# L02-E6-Select
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error

# Selecting Columns

The select() function is simple, in that, you provide a comma separated list of columns and only those columns are selected. Practice makes perfect. Let’s revisit the select() function and work with some of its helper functions to be able to select the columns we are looking for. In a prior exercise, we explored the basic select() operation, namely, provide it a list of columns and only those columns are returned in a data frame, in that order. Often times, we aren’t selecting columns just to remove columns we don’t need, but rather selecting columns that we need to do some additional data wrangling or cleansing with such as data type conversion. This makes selecting columns a much more common task than just part of the data import process. 

In this exercise, we will explore some additional time-saving features select() provides. 
Admittedly, the dataset we are working with have only several columns. As a result, some functions we won't be able to cover, and all this might seem a bit trivial. In practice, however, there can be data frames with hundreds or thousands of columns. Managing these manually would be painful. Often, these thousand-column data frames have some patterns in their column names. Maybe they end with a 3-digit number, or maybe the date columns end with _date. We can exploit these patterns. Being able to more programmatically select the columns of interest will become more important in these cases. After all, we are in a programming environment, so let’s take advantage of it.

We will also use regular expressions. Regular Expressions is a popular method for text pattern matching that is implemented in many programming languages. Full coverage is outside the scope of this course. It is used briefly in this exercise to make you aware of it and it will show up more, later in the course. Just know that R makes extensive use of regular expressions and is the default for most text matching functions. So be careful when trying to match symbols as they may be interpreted as regular expression modifiers.


## R Features
* library()
* glimpse()
* select()
* select_if()
* ? help
* %>% pipe
* names()
* sort()
* rename()
* \- exclude
* : range
* starts_with(): starts with a prefix
* ends_with(): ends with a prefix
* contains(): contains a literal string
* matches(): matches a regular expression
* everything(): all variables
* regular expressions
* is.character()
* is.integer()
* is.double()
* is.numeric()
* is.logical()


## Datasets
* mpg

In [1]:
# Load libraries
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
# Explore data structure
# Data: mpg
mpg %>% glimpse()

Observations: 234
Variables: 11
$ manufacturer [3m[38;5;246m<chr>[39m[23m "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
$ model        [3m[38;5;246m<chr>[39m[23m "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
$ displ        [3m[38;5;246m<dbl>[39m[23m 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
$ year         [3m[38;5;246m<int>[39m[23m 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
$ cyl          [3m[38;5;246m<int>[39m[23m 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
$ trans        [3m[38;5;246m<chr>[39m[23m "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
$ drv          [3m[38;5;246m<chr>[39m[23m "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
$ cty          [3m[38;5;246m<int>[39m[23m 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
$ hwy          [3m[38;5;246m<int>[39m[23m 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
$ fl           [3m[38;5;246m<c

In [4]:
# Display the select() help
?select

# select {dplyr}	R Documentation
Select/rename variables by name
## Description
select() keeps only the variables you mention; rename() keeps all variables.

## Usage
select(.data, ...)

rename(.data, ...)

In [5]:
# select columns with comma delimited list
# Only display the column names instead of glimpse to reduce output
# select columns: hwy, displ, cyl, class
# Hint: select(), names()
mpg %>% 
   select(hwy, displ, cyl, class) %>% 
   names()

Notice that it simply returns the list of column names.

# sort()
Sometimes for longer lists of columns, I like to see them in alphabetical order. sort() can help with this.

In [6]:
# sort column names
# Hint: sort()
mpg %>% 
   names() %>% 
   sort()

Notice that all the columns are in alphabetical order. The order in the data frame didn't change, only the order that they are being displayed was sorted.

# Select all columns except some
Use - (minus) to remove columns. These should be ordered first in the select().

In [7]:
# select all columns except trans and fl
mpg %>%
   select(- fl, - trans) %>%
   names()

# Select columns by range
Use <column name>:<column name> for start and end range.

In [8]:
# select all columns inclusively between model and hwy
mpg %>% 
   select( model : hwy ) %>% 
   names()

# Combine different selection methods

In [12]:
# select all columns inclusively between displ and fl 
# Exclude trans 
# With the last column being manufacturer
# Hint: if you are selecting columns and removing columns then put the column removal at the end 
mpg %>% 
   select(displ : fl , manufacturer, - trans) %>% 
   names() %>% 
    sort()

# Rename columns while selecting them
Pattern is: new_name = current_name

In [14]:
# select displ and rename it displacement
mpg %>% 
   select( displacement = displ ) %>% 
   names()

Notice that the column was renamed, but it was also the only column selected. If we want all the columns, and rename some of them, then the rename() function is more appropriate for this task. 

# rename()
Select all the columns and rename some. rename() is simpler when renaming columns and selecting all columns.

In [16]:
# Rename displ to displacement yet include all columns
# Hint: rename()
mpg %>% 
   rename( displacement = displ ) %>% 
   names()

Notice that rename doesn't change the order of the columns or drop any columns.

In [17]:
# Display help for starts_with() which displays all select helper functions
? starts_with

# select_helpers {dplyr}	R Documentation
Select helpers
## Description
These functions allow you to select variables based on their names.

starts_with(): starts with a prefix

ends_with(): ends with a prefix

contains(): contains a literal string

matches(): matches a regular expression

num_range(): a numerical range like x01, x02, x03.

one_of(): variables in character vector.

everything(): all variables.

## Usage
current_vars()

starts_with(match, ignore.case = TRUE, vars = current_vars())

ends_with(match, ignore.case = TRUE, vars = current_vars())

contains(match, ignore.case = TRUE, vars = current_vars())

matches(match, ignore.case = TRUE, vars = current_vars())

num_range(prefix, range, width = NULL, vars = current_vars())

one_of(..., vars = current_vars())

everything(vars = current_vars())

# starts_with()
This can help select columns using string literal matching starting at the beginning of the column name.

In [18]:
# Select all columns that start with "m"
# Hint: starts_with()
mpg %>% 
   select( starts_with("m")) %>% 
   names()

# ends_with()
Just like starts_with() but matches the end of the column name instead of the start.

In [19]:
# Select columns that end with 'y'
# Hint: ends_with()
mpg %>% 
   select( ends_with("y")) %>% 
   names()

# contains()
This will match the string literal anywhere in the column name, the start, the end, or anywhere in the middle.

In [20]:
# Select columns that contain "an"
# Hint: contains()
mpg %>% 
   select( contains("an")) %>% 
   names()

# matches()
Select columns that match a text pattern. matches() uses a string pattern called Regular Expressions allowing for wildcard matches. 

A . (dot) matches any character   

A * (asterisk) means 0 or more occurrences of the last character

So .\* means any number of characters. This allows you to have two pattern matches together like starts with and ends with and anything in the middle. The .\* accounts for the 'anything in the middle'.

In [21]:
# Select columns that matches an "a" and then an "s" later in the string
# Hint: matches()
mpg %>% 
   select( matches("a.*s")) %>% 
   names()

In regular expressions the '.' means any character and the '\*' modified the previous pattern character adding 0 or more "a.*s" means look for an a followed by 0 or more of any chararcter followed by an s.

# everything()
Reordering one column to the front. everything() means all columns not specified.

In [22]:
# Select all columns with class ordered to be the first column
# Don't change the order of any other column
# Hint: everything()
mpg %>% 
   select(class, everything()) %>%
   names()

# select_if()
There are a number of verbs that can be added to the select function to extend its versatility. Look for these verbs n other tidyverse functions as well. 

A useful verb is 'if'. select_if() will return the columns if the specified function returns true for that column. If I want to select all of the columns that are of type character, I can use the is.character function which returns true or false if that column is a character. 

When a function is used in select_if() the function parenthesis are left off. For example: 

select_if(dataframe, is.character) will return the character columns of that dataframe.

In [23]:
# View help on select_if()
? select_if

# select_all {dplyr}	R Documentation
Select and rename a selection of variables
## Description
These scoped variants of select() and rename() operate on a selection of variables. The semantics of these verbs have simple but important differences:

Selection drops variables that are not in the selection while renaming retains them.

The renaming function is optional for selection but not for renaming.

## Usage
select_all(.tbl, .funs = list(), ...)

rename_all(.tbl, .funs = list(), ...)

select_if(.tbl, .predicate, .funs = list(), ...)

rename_if(.tbl, .predicate, .funs = list(), ...)

select_at(.tbl, .vars, .funs = list(), ...)

rename_at(.tbl, .vars, .funs = list(), ...)

In [24]:
# Select all character columns
#Hint: select_if(), is.character()
mpg %>% 
   select_if(is.character) %>% 
   names()

In [25]:
# Select all integer columns
#Hint: select_if(), is.integer()
mpg %>% 
   select_if( is.integer) %>% 
   names()

In [28]:
# Select all double columns
#Hint: select_if(), is.double()
mpg %>% 
   select_if(is.double) %>% 
   names()

In [30]:
# Select all numeric columns
#Hint: select_if(), is.numeric()
mpg %>% 
   select_if( is.numeric ) %>% 
   names()

In [32]:
# Select all logical columns
#Hint: select_if(), is.logical()
mpg %>% 
   select_if(is.logical) %>% 
   names()

Looks like there were no logical columns in this dataframe.

# Code Summary
Let's combine several of these functions together.

In [39]:
# displ as the first column renamed to displacement
# followed by all columns containing "r" except for year
# include all columns that end with "y"
# include all columns that starts with the letter "f"

mpg %>% 
   select( displacement = displ, contains("r"), ends_with("y"), starts_with("f"), -year) %>% 
   names()

# Replicate the same criteria as above and additionally
# filter to only return numeric columns
mpg %>% 
   select(displacement = displ, contains("r"), ends_with("y"), starts_with("f"), -year) %>% 
   select_if( is.numeric ) %>% 
   names()


The output should be: 

'displacement' 'manufacturer' 'trans' 'drv' 'cty' 'hwy' 'fl'

'displacement' 'cty' 'hwy'
