# L02-E4-Filtering
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error


# Subsets

Subsets are important to provide focus for our analysis, whether it be removing outliers or irrelevant data, or a drill down on a specific category. Being able to filter rows and columns becomes commonplace in data analysis. Fortunately, this is an area of strength for the tidyverse and a package is dedicated to this named, dplyr. 

We will start simple with select(), filter() and arrange(). 
•	select() chooses the columns you want to keep and optionally rename them. 
•	filter() uses logical expressions to determine what rows are kept and discarded.
•	arrange() sorts the rows based upon the columns you give it.

Simply put, select() discards columns, filter() discards rows, and arrange sorts rows. For those that are familiar with SQL, these three functions are similar to SQL SELECT, WHERE, and ORDER BY.

We will also include some base R functions, head(), and tail(). 
head() will return the first 6 rows of a data set and tail() the last 6. 6 is the default value. You can override that parameter with the value you want. Combining head() and tail() with arrange() can return the top important rows for your analsyis.

We will look at top_n() and compare it to head() or tail() with arrange() as an alternative to subsetting the top or bottom portion of the data. 


## R Features
* library()
* glimpse()
* select()
* <- variable assignment operator
* head()
* tail()
* arrange()
* desc()
* labs()
* filter()
* ggplot()
* geom_jitter()
* geom_smooth()
* facet_wrap()
* facet_grid()

## Datasets
* mpg

# Load libraries

In [None]:
??? (tidyverse)

# Explore data's structure

In [None]:
# Dataframe: mpg
# Hint: glimpse()
??? (mpg)

# Help on select()
Let's look at the usage info from help for select()

In [None]:
? select

# select {dplyr}	R Documentation
Select/rename variables by name
## Description
select() keeps only the variables you mention; rename() keeps all variables.

# Using select()
Let's start by selecting only the variables we need for plotting.

Use the form: 

select(dataframe_name, column1, column2, column3, column4)

In [None]:
# Select columns: hwy, displ, cyl, and class
select(mpg, ??? , ??? , ??? , ??? )

After running this notice the columns in the output match the select list we really need to store this somewhere so we can use it later.

# Store result in a variable
Store the result from select(). Let's use the generic df variable name. I use this a lot for temporary variables and inside function. df stands for data frame. We will use the two character assignment operator '<-'. 

In [None]:
# Select columns: hwy, displ, cyl, class
# Store in variable: df
??? <- select(mpg, ??? )

What happens when you run this? Was there any output?

# Explore the data's structure
The above block didn't have any output so we didn't see anything. But we can look at what is in this data frame

Do you recall the function?

In [None]:
# Hint: glimpse()
???

Notice a familiar output, but of just 4 variables.

Note that the number of rows / observations, 234, has not changed.

Troubleshooting tip: If you get the following output 'function (x, df1, df2, ncp, log = FALSE)' then you haven't successfully saved the mpg selected colunmns into the df variable. df is also a function as described above that we likely won't use so we are going to repurpose df to be our working variable name for our data. 

# head() and tail() 
There are two other functions that are helpful for looking at the data, head() and tail()

Let's review the help on these.

In [None]:
# Help on head
? head

In [None]:
# Help on tail
? tail

Notice that both head and tail point to the same help page where both are described. This is pretty common with related functions. Feel free to close the help window if desired. 

# head {utils}	R Documentation
Return the First or Last Part of an Object
## Description
Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

# Using head() and tail()
There are two other functions for looking at the data, head() and tail(). Run both of those passing in our data frame variable.

In [None]:
# Use head()
??? (df)

# Use tail()
??? (df)

You can see two tables. The first is the top 6 rows aka head(). The second is the bottom 6 row aka tail().

# Altering the number of rows in head() and tail()
What if we wanted a different number of rows returned? We can pass a number to head() and tail() representing the number of rows we want them to return. This is the n parameter.

How could you find the parameter name? It is described in the function's help documentation.

In [None]:
# Return the top 2 rows
head(df, n = ??? )

# Return the bottom 3 rows
tail(df, n = ??? )

# Using positional arguments with head()
Would head(df, 2) work without the 'n = '?

In [None]:
# Return the first 7 rows using parameter position instead of name
??? (df, 7)

head() and tail() of a data frame is helpful to understand the data. I prefer glimpse() for this though because it includes the data types as well. 

# Help on arrange()
What would be nice is to sort it first and then see the top or bottom of that sorting uses arrange()

Let's pull up the help on arrange(). Note that we need to preface it with dplyr:: so it will look like dplyr::arrange in order to get the dplyr help page and not the plyr help page. 

In [None]:
? dplyr::arrange

# arrange {dplyr}	R Documentation
Arrange rows by variables
## Description
Use desc() to sort a variable in descending order.

As you may be able to tell, arrange is not expecting you to name the parameters, just a comma separated list of column names. Also note that you can use desc(column_name) to change from ascending to descending order. 

You can provide multiple column names like arrange(data_frame_name, column1, column2, desc(column3)) to order by multiple columns. 

# Using arrange()
Let's take our smaller data frame and find the vehicles with the worst highway miles per gallon. 

In [None]:
# arrange df by lowest hwy
??? (df, hwy)

I would hate to own the 12 mpg pickup with today's gas prices!! Consider an electric vehicle!!

# Combining head() and arrange()
Let's combine head and arrange to get the top 6 worst gas guzzlers. 

To do this you will need to wrap the arrange function inside the head function. The innermost function is processed first and it works its way out.

It takes the form:

outer_function(inner_function (something))

In [None]:
# Display the top 6 worst gas guzzlers
# Hint: head(), arrange()
??? ( ??? (df, hwy))

# Combining functions with parameters
What if you want to pass parameters to these functions such as changing the n value of the head parameter? 

Where does the n parameter go?

Positional argument function nesting takes the form:

outer_function(inner_function(inner_parameter1, inner_parameter2), outer_parameter2)

Where is outer_parameter1? It is the return value of inner_function. This is the positional way to specify parameters. It is also possible to specify the parameter name. This is the recommended way to make the code more readable when there is a lot of function nesting. 

Named argument function nesting takes the form:

outer_function(outer_arg1 = inner_function(inner_arg1 = inner_parameter1, inner_arg2 = inner_parameter2), outer_arg2 = outer_parameter2)

In [None]:
# Display the top 2 worst hwy vehicles 
# Hint: head(), arrange()
head(arrange(df, ??? ), n = ??? )

In [None]:
# How about the highest hwy
# Hint: tail()
??? (arrange(df, hwy))

It worked but the highest is not at the top but the bottom of the list. Let's fix this next.

In [None]:
# Get the top 10 highest hwy vehicles
# Hint: head(), arrange(), desc()
head(arrange(df, ??? (hwy)), n = ??? )

Much better!

In [None]:
# Let's order by more than one column
# Order by by class, then by highest hwy
# Return top 10
head(arrange(df, ??? , desc(hwy)), n = 10)

We can see the highest 2seater hwy.

# Help on filter()

Moving to filter, let's check the help. I like to scroll down and look at the examples.

Need to use dplyr::filter to get the one we want.

In [None]:
? dplyr::filter

# filter {dplyr}	R Documentation
Return rows with matching conditions
## Description
Use filter() find rows/cases where conditions are true. Unlike base subsetting, rows where the condition evaluates to NA are dropped.

# Using filter()
The filter function takes a comma seperated list of boolean expressions, meaning that each expression evaluates to true or false or NA. The list of expressions are implicitly ANDed together. The comparison operators are <, <=, ==, >=, >, <>, !=. The last two both mean 'not equal'. Note that for equality it uses two equal signs also called a double equal. 

Within each expression, there can be sub expressions using nested parenthesis and those can be ANDed using '&' or ORed using '|'.


In [None]:
# Let's find those 5 cylinder vehicles
# Use the double equal
filter(df, cyl ??? 5)

There is no question now that there are 4 vehicles with 5 cylinders in this dataset. Is this accurate or is this a data quality issue?

# Create and plot a filtered dataframe
Let's create a filtered data frame removing 5 cylinder and then plot it.

In [None]:
# Name: df_filtered
# Filter: remove 5 cylinder rows
df_filtered <- filter(df, cyl ??? 5)  # != or <> can be used for not equal

# Plot: scatterplot x = displ, y = hwy; facet by cyl
ggplot(df_filtered, mapping = aes( ??? )) +
   geom_point() +
   facet_wrap( ??? )

It looks better without the 5 column. Were you spoiled with the +jitter, +alpha, and +smooth? I was!

# Filtered plot with some polish
You can use the variable from the cell above df_filtered in this cell. Everything is running in the same R kernel. So it remembers everything regardless of where the code it, you just need to run it for the R kernel to know about it.

Using the existing df_filtered data frame and the previous plot paramaters, add +jitter, +alpha (0.5), and +smooth (linear) with no confidence interval bands. Facet by class and cyl. Connect class to color.

In [None]:
# Plot: scatterplot x = displ, y = hwy, color = class; 
# facet class (rows) by cyl (columns)
# Add jitter, alpha, and linear trend line and title
ggplot(df_filtered, mapping = aes(x = displ, y = hwy, color = class)) +
   ??? (alpha = 0.5) +
   ??? (method = "lm", se = FALSE) +
   ??? ( class ~ cyl ) + 
   labs(title = "mpg trend: displ vs hwy by class and cyl",
       subtitle = "Points jittered, alpha blended, colored by class, and faceted by class and cyl")

That is what I am used to.

# Plotting a subset of the data
What would happen if we only plotted the first 10 rows?

In [None]:
# Plot the first 10 rows of the prior plot
ggplot( ??? (df_filtered, n = ??? ), mapping = aes(x = displ, y = hwy, color = class)) +
   geom_jitter(alpha = 0.5) +
   geom_smooth(method = "lm", se = FALSE) +
   facet_grid( class ~ cyl ) + 
   labs(title = "mpg trend: displ vs hwy by class and cyl",
       subtitle = "First 10 rows +jitter, +alpha , colored by class, facet class x cyl")

Was df_filtered even sorted??

# Plot top 10 results
Let's plot the top 10 highest hwy of the previous plot.

In [None]:
# Plot the top 10 highest hwy of the prior plot
ggplot(head( ??? (df_filtered, ??? (hwy)), n = 10), mapping = aes(x = displ, y = hwy, color = class)) +
   geom_jitter(alpha = 0.5) +
   geom_smooth(method = "lm", se = FALSE) +
   facet_grid( class ~ cyl ) + 
   labs(title = "mpg trend: displ vs hwy by class and cyl",
       subtitle = "Top 10 hwy vehicles, +jitter, +alpha , colored by class, facet class x cyl")

Where did the other cylinder columns go?

The facet default is to drop categories with no data but, you can change that if you like.

Notice how the nested functions are getting a bit hard to manage. Function nesting can get confusing. You will learn another method later in the course called pipelining that will convert the function nesting into a serial pipeline that can make the code more readable. Pipelining is my preferred method, but it only works in certain situations so there will still be a need for function nesting.

# Code Summary
Let's summarize the code we just did.

In [None]:
# Load libraries
???

# Explore data structure
# Data: mpg
???

# Select hwy, displ, cyl, class
# Store in df
df <- select(mpg, ???)

# Last 5 rows of df
tail(df, n = ??? )

# Name: df_filtered
# Filter: remove 5 cylinder rows
df_filtered <- filter(df, ??? )  # != or <> can be used for not equal

# Store top 10 hwy 
df_top_hwy <- head(arrange(df_filtered, ??? ), n = 10)

# Plot using above variable
# x = displ, y = hwy, color = class; 
# facet class (rows) by cyl (columns)
# Add jitter, alpha, and linear trend line and title
ggplot( ??? , mapping = aes( ??? )) +
   geom_jitter(alpha = 0.5) +
   geom_smooth( ??? ) +
   facet_grid( ??? ) + 
   labs(title = "mpg trend: displ vs hwy by class and cyl",
       subtitle = "Top 10 hwy vehicles, +jitter, +alpha , colored by class, facet class x cyl")