<a href="https://colab.research.google.com/github/sharonma1218/stats-306/blob/main/lecture01_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 1:  Data transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Learn [how to manipulate data](#Data-manipulation), including:
    * [Filtering data](#Row-operation-#1:-Filtering-data)
    * [Arranging (sorting) rows](#Row-operation-#2:-Sorting)
    * [Finding distinct rows](#Row-operation-#3:-distinct())
    * [Selecting subsets of columns](#Column-operation-#1:-select()
    
This lecture note corresponds to [Chapter 4](https://r4ds.hadley.nz/data-transform.html) of your book, and we will also use some ideas from Chapters [14](https://r4ds.hadley.nz/logicals.html)-[15](https://r4ds.hadley.nz/numbers.html).
</div>


## Questions that arose on Piazza
Reminder to please post your questions on Piazza. You will get a faster response, and everyone can see the answers.
- "What's the difference between Github, Colab, and Jupyter?"
- "How is credit assigned for Perusall?"
- "How are the iClicker questions graded?"

## Data manipulation
Manipulating data is an important part of data science--perhaps the most important! As a data scientist, most of your time will be spent simply getting your data into a format that you can analyze:
![data manipulation plot](https://www.datanami.com/wp-content/uploads/2020/07/Anaconda_1.png)
https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/

There are a lot of built-in commands for data manipulation in R:
```{r}
# Traditional data manipulation commands in R
subset()
aggregate()
merge()
reshape()
```
These commands are old and somewhat difficult to use. Instead of the traditional commands, we are going to focus on the `dplyr` package for filtering data. They provide a nice suite of replacements for the traditional commands, which have a consistent, unified interface and interoperate nicely with each other.

The `dplyr` package is part of `tidyverse`, so it is already loaded once we run `library(tidyverse)` (which we remember from last lecture that we should do each time we start R).

NOTE: We will be using the tools that comes with dplyr package instead of traditional tools that come with R. 

We will be using the `nycflights13` data set for this lecture. It does not come with tidyverse. If you are running Jupyter on your own computer you will first need to `install.packages("nycflights13")`.  This data set is about flights departing from the NYC area in 2013.  You have worked with part of this data set in Homework 2.

NOTE: Since Colab is a virtual machine, we need to install.packages("packagename") every single time. Get into a habit of this.

In [1]:
install.packages("nycflights13")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [2]:
# install.packages('nycflights13')
library(tidyverse)
library(nycflights13)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [3]:
print(flights)

[90m# A tibble: 336,776 × 19[39m
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m      [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  
[90m 1[39m  [4m2[24m013     1     1      517        515       2     830     819      11 UA     
[90m 2[39m  [4m2[24m013     1     1      533        529       4     850     830      20 UA     
[90m 3[39m  [4m2[24m013     1     1      542        540       2     923     850      33 AA     
[90m 4[39m  [4m2[24m013     1     1      544        545      -[31m1[39m    [4m1[24m004    [4m1[24m022     -[31m18[39m B6     
[90m 5[39m  [4m2[24m013     1     1      554        600      -[31m6[39m     812     837     -[31m25[39m DL     
[90m 6[39m  [4m2[24m013     1     1      554       

Tibble is similar to dataframe and we will learn more about it later in the course.   For now, you can interpret it as a dataframe. 

Notice the types of the variables in `flights`. They include:

* **int** integers
* **dbl** double precision floating point numbers
* **chr** character vectors, or strings
* **dttm** date-time (a date along with a time)

Other types available in R but not represented above include:

* **lgl** logical (either `TRUE` or `FALSE`)
* **fctr** factor (categorical variable with a fixed number of possible values)
* **date** date

NOTE: Tibble is similar to dataframe, but it is more efficient. About 337,000 rows, 19 cols (not all shown), 9 variables (& their names are listed). Types of variables: Int = integers, chr = characters, dlb = double precision floating number, dttm = date time. Date = w/o time.


## What's a data frame?

Our main goal in R is to work with data, and one of the most fundamental objects in R is the *data frame*. Think of a data frame as a container for a bunch of *vectors* of data:

![dataframe](https://garrettgman.github.io/images/tidy-2.png)

Note: string vectors together like legos to make a table. data frame is like table. 

## Row operation #1: Filtering data
The first operation we'll learn about is filtering. Filtering is interpereted to mean "keep only the rows whose columns match these criteria". The syntax for the `filter` command is 
```{r}
filter(<TABLE>, <LOGICAL CRITERIA>)
```
This commands returns a new tibble whose rows all match the specified criteria.

NOTE: logical criteria usually a true/false.

### Types of logical criteria
For those who are new to programming, we now briefly review [boolean logic](https://en.wikipedia.org/wiki/Boolean_algebra). The basic logical operators in R are `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). The first four are used for comparing numbers and function exactly as in mathematics:
```{r}
> 1 > 1
[1] FALSE
> 1 >= 1
[1] TRUE
> 2.5 < 3
[1] TRUE
> 2.5 <= 3
[1] TRUE
```

### Assignment vs. equality
An extremely common mistake for beginner programmers is to confuse `=` and `==` ("double equals") when writing code. As we have seen,
- `=` is used for
    - assigning a value to a variable, and
    - passing a named parameter into a function. 
- `==` is used for testing equality. 


```{r}
> a = 1  # assigns the integer 1 to a
> b = 2  # assigns the integer 2 to b
> a == 1 # tests that a equals 1 ... is a equal to 1?
[1] TRUE
> b == 1 # tests that b equals 1 ... is b equal to 1?
[1] FALSE
```

In [None]:
# Example of assignment vs equality for ages

### Boolean operations
Logical expressions are combined using *boolean operations*. The basic boolean operations are `and`, `or`, and `not`, denoted `&`, `|` and `!` respectively.

In [None]:
! FALSE # NOTE: the ! flips that. the opposite is true.
FALSE | TRUE # NOTE: false or true so the entire operation becomes true.
! TRUE

There are also doubled versions of `&` and `|` denoted `&&` and `||`. Do not use these yet. We will return to them later in the course when we discuss programming and control flow.

R abbreviates `TRUE` and `FALSE` as `T` and `F`:

In [None]:
c(T, F)

### Vectorization
What happens when we ask whether a *vector* satisfies a logical condition? R returns a new logical vector with the same number of entries, showing whether each entry satisfies that condition:

In [6]:
# Examples of logical conditions for class age
a = c(1:10)
print(a)
a>3 # NOTE: we get a vector of booleans. goes through each element in the vector to see whether or not it's >3
a[a>3] # NOTE: only print the vector of values that are >3

 [1]  1  2  3  4  5  6  7  8  9 10


### Testing for membership
Another useful operator is `%in%`:
```r
x %in% y
```

return `TRUE` if the value `x` is found in the vector `y`:

In [None]:
"a" %in% c(1, 2, 3) # NOTE: is a in the vector of 1 to 3. no b/c a character is not in this vector.
("a" == 1) | ("a" == 2) | ("a" == 3) # NOTE: is a exactly equal to 1, or is a exactly equal to 2, or is a exactly equal to 3. 
                                      # this is the big long version of doing the above

## Missing data

NOTE: na doesn't actually mean missing. there is still a value there, but you don't know what it is. that's why when you add "i don't know" to 5, the whole thing is "i don't know"

Something you will often encounter when working with real data are missing observations. R has a special value, `NA` , for representing missing data. You can think of the value of `NA` as "I don't know". Thus, logical and mathematical operations involving `NA` will again return `NA`, so that `NA`s "propagate through" the computation:

In [None]:
NA + 5 # 5 + I-don't-know = I-don't-know

In [None]:
1 < NA  # Is 1 less that I-don't-know? I don't know.

In [7]:
NA < NA  # Is I-don't-know greater than I-don't-know? I don't know.
mean(c(1, NA, 3), na.rm=F) # And so forth. NOTE: Need to add the na.rm=F ... i don't want to remove the na, which allows the mean to be na itself
mean(c(1,NA,3), na.rm=T) # NOTE: removes the na so there is actually a value here 

In [None]:
NA == 1 # NOTE: is na equal to 1 

Since you cannot test `NA`s for equality, R has a special function for determining whether a value is `NA`:

In [None]:
5 == NA

is.na(NA) # NOTE: is this an na 
is.na(1) # NOTE: is na equal to 1 

In [8]:
# Examples of missing values with the ages vector
ages=c(10,20,30,NA)
is.na(ages) # NOTE: is each element in the vector a na

### Examples of filtering
Let's use what we have just learned to evaluate some simple queries on the `flights` dataset. Let's first narrow down to all flights that departed on December 31:

In [10]:
# Filter to all flights on 12/31
filter(flights,month==12,day==31) # NOTE: filter(dataframe,logical condition 1, condition 2)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,13,2359,14,439,437,2,B6,839,N566JB,JFK,BQN,189,1576,23,59,2013-12-31 23:00:00
2013,12,31,18,2359,19,449,444,5,DL,412,N713TW,JFK,SJU,192,1598,23,59,2013-12-31 23:00:00
2013,12,31,26,2245,101,129,2353,96,B6,108,N374JB,JFK,PWM,50,273,22,45,2013-12-31 22:00:00
2013,12,31,459,500,-1,655,651,4,US,1895,N557UW,EWR,CLT,95,529,5,0,2013-12-31 05:00:00
2013,12,31,514,515,-1,814,812,2,UA,700,N470UA,EWR,IAH,223,1400,5,15,2013-12-31 05:00:00
2013,12,31,549,551,-2,925,900,25,UA,274,N577UA,EWR,LAX,346,2454,5,51,2013-12-31 05:00:00
2013,12,31,550,600,-10,725,745,-20,AA,301,N3CXAA,LGA,ORD,127,733,6,0,2013-12-31 06:00:00
2013,12,31,552,600,-8,811,826,-15,EV,3825,N14916,EWR,IND,118,645,6,0,2013-12-31 06:00:00
2013,12,31,553,600,-7,741,754,-13,DL,731,N333NB,LGA,DTW,86,502,6,0,2013-12-31 06:00:00
2013,12,31,554,550,4,1024,1027,-3,B6,939,N552JB,JFK,BQN,195,1576,5,50,2013-12-31 05:00:00


How does all this work? Basically what R does here is create a logical vector that has one entry for each **row** of the input data frame. Then, it returns a new data frame which contains all the rows where the logical vector evaluated to `True`.

## Remark:
Using `==` for testing equality is very important in `R`.  `R` will yield an error if you use `=`.  

In [11]:
# what happens when you forget to use == in filter?
filter(flights,month=1)

ERROR: ignored

## 🤔 Quiz
How many flights in the table departed in October, November, or December?

<ol style="list-style-type: upper-alpha;">
    <li>Fewer than 50,000</li>
    <li>Between 50,000 and 60,000</li>
    <li>Between 60,000 and 70,000</li>
    <li>Between 70,000 and 80,000</li>
    <li>More than 80,000</li>
</ol>

In [17]:
# Code to get number of flights that departed in Oct-Dec

“longer object length is not a multiple of shorter object length”


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,10,1,536,545,-9,809,855,-46,AA,2243,N630AA,JFK,MIA,132,1089,5,45,2013-10-01 05:00:00
2013,10,1,544,550,-6,912,932,-20,B6,939,N593JB,JFK,BQN,191,1576,5,50,2013-10-01 05:00:00
2013,10,1,550,600,-10,649,659,-10,US,2167,N749US,LGA,DCA,39,214,6,0,2013-10-01 06:00:00
2013,10,1,551,600,-9,655,708,-13,B6,2180,N238JB,EWR,BOS,40,200,6,0,2013-10-01 06:00:00
2013,10,1,553,600,-7,811,829,-18,DL,563,N961DL,LGA,ATL,105,762,6,0,2013-10-01 06:00:00
2013,10,1,555,600,-5,810,840,-30,B6,27,N580JB,EWR,MCO,117,937,6,0,2013-10-01 06:00:00
2013,10,1,557,600,-3,851,923,-32,UA,303,N590UA,JFK,SFO,331,2586,6,0,2013-10-01 06:00:00
2013,10,1,558,600,-2,757,815,-18,FL,345,N338AT,LGA,ATL,98,762,6,0,2013-10-01 06:00:00
2013,10,1,559,600,-1,719,730,-11,AA,301,N466AA,LGA,ORD,115,733,6,0,2013-10-01 06:00:00
2013,10,1,604,610,-6,751,813,-22,DL,1919,N986DL,LGA,MSP,146,1020,6,10,2013-10-01 06:00:00


“longer object length is not a multiple of shorter object length”


“longer object length is not a multiple of shorter object length”


### Counting matches
Sometimes we just want to know how many observations match a given filter. The `nrow()` command can be used to count the number of rows in a data table.

Let us try to calculate how many flights with missing departure time in our data.

In [21]:
# How many flights have a missing departure time
nrow(filter(flights,is.na(dep_time)))

How about the number of flights departing between Jan and Mar?

In [22]:
# no. of flights departing between Jan and Mar
filter(flights,month%in%c(1:3))
nrow(filter(flights,month%in%c(1:3)))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


“longer object length is not a multiple of shorter object length”


### Row operation #1.5: `head()`/`tail()`
A particularly useful kind of `filter()` is to only keep the first, or last, rows. This happens so often in practical data science, that there are built-in commands to do it:

In [23]:
# first 6 rows of flights
head(filter(flights,month==1))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In [28]:
# last 6 rows of flights
tail(filter(flights, month==12))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,,855,,,1142,,UA,1506,,EWR,JAC,,1874,8,55,2013-12-31 08:00:00
2013,12,31,,705,,,931,,UA,1729,,EWR,DEN,,1605,7,5,2013-12-31 07:00:00
2013,12,31,,825,,,1029,,US,1831,,JFK,CLT,,541,8,25,2013-12-31 08:00:00
2013,12,31,,1615,,,1800,,MQ,3301,N844MQ,LGA,RDU,,431,16,15,2013-12-31 16:00:00
2013,12,31,,600,,,735,,UA,219,,EWR,ORD,,719,6,0,2013-12-31 06:00:00
2013,12,31,,830,,,1154,,UA,443,,JFK,LAX,,2475,8,30,2013-12-31 08:00:00


## Row operation #2: Sorting
Often we want to sort our data based on one or more column values. This can arise for several reasons:
- The data have some natural order (for example, chronological)
- We want to learn about "extreme" features of the data:
  - "What was the most delayed flight?"
  - "What was the hottest day of the year?"
  - "Who was the tallest NBA player in 2010?"

`arrange()` changes the order of the rows based on the value of the columns. 
- It takes a data frame and a set of column names (or more complicated expressions) to order by. 
- If you provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on.
- By default, things are sorted in **ascending** order.

In [31]:
# sort flights in ascending order by month and day
arrange(flights,month) # NOTE: sort by month ascending

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,1,13,2359,14,446,445,1,B6,745,N715JB,JFK,PSE,195,1617,23,59,2013-12-01 23:00:00
2013,12,1,17,2359,18,443,437,6,B6,839,N593JB,JFK,BQN,186,1576,23,59,2013-12-01 23:00:00
2013,12,1,453,500,-7,636,651,-15,US,1895,N197UW,EWR,CLT,86,529,5,0,2013-12-01 05:00:00
2013,12,1,520,515,5,749,808,-19,UA,1487,N69804,EWR,IAH,193,1400,5,15,2013-12-01 05:00:00
2013,12,1,536,540,-4,845,850,-5,AA,2243,N634AA,JFK,MIA,144,1089,5,40,2013-12-01 05:00:00
2013,12,1,540,550,-10,1005,1027,-22,B6,939,N821JB,JFK,BQN,189,1576,5,50,2013-12-01 05:00:00
2013,12,1,541,545,-4,734,755,-21,EV,3819,N13968,EWR,CVG,95,569,5,45,2013-12-01 05:00:00
2013,12,1,546,545,1,826,835,-9,UA,1441,N23708,LGA,IAH,204,1416,5,45,2013-12-01 05:00:00
2013,12,1,549,600,-11,648,659,-11,US,2167,N945UW,LGA,DCA,42,214,6,0,2013-12-01 06:00:00
2013,12,1,550,600,-10,825,854,-29,B6,605,N706JB,EWR,FLL,140,1065,6,0,2013-12-01 06:00:00


In [32]:
# flight that got delayed the most 
arrange(flights,desc(dep_delay))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,6,15,1432,1935,1137,1607,2120,1127,MQ,3535,N504MQ,JFK,CMH,74,483,19,35,2013-06-15 19:00:00
2013,1,10,1121,1635,1126,1239,1810,1109,MQ,3695,N517MQ,EWR,ORD,111,719,16,35,2013-01-10 16:00:00
2013,9,20,1139,1845,1014,1457,2210,1007,AA,177,N338AA,JFK,SFO,354,2586,18,45,2013-09-20 18:00:00
2013,7,22,845,1600,1005,1044,1815,989,MQ,3075,N665MQ,JFK,CVG,96,589,16,0,2013-07-22 16:00:00
2013,4,10,1100,1900,960,1342,2211,931,DL,2391,N959DL,JFK,TPA,139,1005,19,0,2013-04-10 19:00:00
2013,3,17,2321,810,911,135,1020,915,DL,2119,N927DA,LGA,MSP,167,1020,8,10,2013-03-17 08:00:00
2013,6,27,959,1900,899,1236,2226,850,DL,2007,N3762Y,JFK,PDX,313,2454,19,0,2013-06-27 19:00:00
2013,7,22,2257,759,898,121,1026,895,DL,2047,N6716C,LGA,ATL,109,762,7,59,2013-07-22 07:00:00
2013,12,5,756,1700,896,1058,2020,878,AA,172,N5DMAA,EWR,MIA,149,1085,17,0,2013-12-05 17:00:00


We sorted the data by month and day, so the top-most rows have the earliest month, folllowed by day.

If you want to sort in **descending** order, write `desc(<column>)` instead:

In [None]:
# flights in descending order of month

In [None]:
arrange(flights,desc(month)) # NOTE: sort by month descending

## 🤔 Quiz
What's the farthest distance traveled by any flight in the dataset?

<ol style="list-style-type: upper-alpha;">
    <li>1200 meters</li>
    <li>2783 miles</li>
    <li>4143 miles</li>
    <li>4983 miles</li>
    <li>4983 km</li>
</ol>

In [34]:
# solution
help(flights) # NOTE: distance is in miles 
arrange(flights,desc(distance))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,857,900,-3,1516,1530,-14,HA,51,N380HA,JFK,HNL,659,4983,9,0,2013-01-01 09:00:00
2013,1,2,909,900,9,1525,1530,-5,HA,51,N380HA,JFK,HNL,638,4983,9,0,2013-01-02 09:00:00
2013,1,3,914,900,14,1504,1530,-26,HA,51,N380HA,JFK,HNL,616,4983,9,0,2013-01-03 09:00:00
2013,1,4,900,900,0,1516,1530,-14,HA,51,N384HA,JFK,HNL,639,4983,9,0,2013-01-04 09:00:00
2013,1,5,858,900,-2,1519,1530,-11,HA,51,N381HA,JFK,HNL,635,4983,9,0,2013-01-05 09:00:00
2013,1,6,1019,900,79,1558,1530,28,HA,51,N385HA,JFK,HNL,611,4983,9,0,2013-01-06 09:00:00
2013,1,7,1042,900,102,1620,1530,50,HA,51,N385HA,JFK,HNL,612,4983,9,0,2013-01-07 09:00:00
2013,1,8,901,900,1,1504,1530,-26,HA,51,N389HA,JFK,HNL,645,4983,9,0,2013-01-08 09:00:00
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,1,10,859,900,-1,1449,1530,-41,HA,51,N388HA,JFK,HNL,633,4983,9,0,2013-01-10 09:00:00


## Row operation #3: `distinct()`
`distinct()` finds all the unique rows in a dataset. If you supply column names, it will find all unique combinations of those columns.

Here is an example: the number of rows in the dataset is 

In [None]:
nrow(flights)

In [35]:
nrow(distinct(flights)) # NOTE: no records that are exactly the same; no duplicates 

Obviously there are many flights scheduled for each day. But is there a flight scheduled for *every* day?

In [37]:
# All the unique days in flights
nrow(distinct(flights,day)) # NOTE: there are flights every single day of the month

## 🤔 Quiz
How many unique *airplanes* are represented in the `flights` table?

<ol style="list-style-type: upper-alpha;">
    <li>3,030</li>
    <li>4,043</li>
    <li>4,044</li>
    <li>90,210</li>
    <li>Can't be determined from the table</li>
</ol>

(Hint: every airplane has a unique tail number.)

In [38]:
# analysis of unique airplanes
nrow(distinct(flights,tailnum))

## Column operation #1: `select()`
`select()` can help you to narrow down a large dataset by just focusing on the variables you’re interested in. `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables., but you can still get the general idea of how it works:

In [None]:
tbl <- select(flights, year, month, day, departure_time = dep_time, arrival_time = arr_time)
head(tbl)

year,month,day,departure_time,arrival_time
<int>,<int>,<int>,<int>,<int>
2013,1,1,517,830
2013,1,1,533,850
2013,1,1,542,923
2013,1,1,544,1004
2013,1,1,554,812
2013,1,1,554,740


In [40]:
select(flights,tailnum, month) # NOTE: gives you only the airplane names 

tailnum,month
<chr>,<int>
N14228,1
N24211,1
N619AA,1
N804JB,1
N668DN,1
N39463,1
N516JB,1
N829AS,1
N593JB,1
N3ALAA,1


Note that `select` drops any variables not explicitly mentioned. To just rename some variables while keeping all others, use `rename`.

In [None]:
head(rename(flights, departure_time = dep_time, arrival_time = arr_time))

year,month,day,departure_time,sched_dep_time,dep_delay,arrival_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


If there are a lot of variables, you can save yourself some typing by using `:` and `-` in combination with select. The colon operator selects a range of variables:

In [None]:
head(select(flights, year:day)) # NOTE: includes all columns b/w year & day 

year,month,day
<int>,<int>,<int>
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1


The negative sign lets you select everything but certain columns:

In [41]:
# select everything except the day column
select(flights,-month) # NOTE: you want all columns except month

year,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


If you want to bring a few variables at the beginning, you can use `everything()` to refer to the remaining variables.

In [43]:
# bring dep_time, arr_time, day, month, year to front
head(select(flights,dep_time,month,everything())) # NOTE: you want dep time to come first, month, then everything else ... & then you only want first 6 entries

dep_time,month,year,day,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
517,1,2013,1,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
533,1,2013,1,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
542,1,2013,1,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
544,1,2013,1,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
554,1,2013,1,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
554,1,2013,1,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In addition, there are some helper functions that only work inside `select()`.

* `starts_with()`, `ends_with()`, `contains()`
* `matches()`
* `num_range()`

You can consult the documentation or type `?select` at the prompt to learn more about these. Here's just one example of their use.

In [44]:
# select all the variables that contain the word "time"
head(select(flights,contains('time'))) # show me all the columns that have time in it... but only show first 6 entries 

dep_time,sched_dep_time,arr_time,sched_arr_time,air_time,time_hour
<int>,<int>,<int>,<int>,<dbl>,<dttm>
517,515,830,819,227,2013-01-01 05:00:00
533,529,850,830,227,2013-01-01 05:00:00
542,540,923,850,160,2013-01-01 05:00:00
544,545,1004,1022,183,2013-01-01 05:00:00
554,600,812,837,116,2013-01-01 06:00:00
554,558,740,728,150,2013-01-01 05:00:00
