# 3. Data transformation

Taken from [the book](https://r4ds.hadley.nz/data-transform).

In [1]:
library(nycflights13)
library(tidyverse)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Exploring the data

In [3]:
glimpse(flights)


Rows: 336,776
Columns: 19
$ year           [3m[90m<int>[39m[23m 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       [3m[90m<int>[39m[23m 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time [3m[90m<int>[39m[23m 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      [3m[90m<dbl>[39m[23m 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       [3m[90m<int>[39m[23m 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time [3m[90m<int>[39m[23m 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      [3m[90m<dbl>[39m[23m 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        [3m[90m<chr>[39m[23m "UA", "UA", "AA", "B6", "DL", "UA", "B6",

In [4]:
head(flights)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


## Intro to `dplyr`

In [6]:
flights |>
  filter(dest == "IAH") |>
  group_by(year, month, day) |>
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  ) |>
  head()


[1m[22m`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.


year,month,day,arr_delay
<int>,<int>,<int>,<dbl>
2013,1,1,17.85
2013,1,2,7.0
2013,1,3,18.315789
2013,1,4,-3.2
2013,1,5,20.230769
2013,1,6,9.277778


In [9]:
flights |>
  filter(dep_delay > 120) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,848,1835,853,1001,1950,851,MQ,3944,N942MQ,JFK,BWI,41,184,18,35,2013-01-01 18:00:00
2013,1,1,957,733,144,1056,853,123,UA,856,N534UA,EWR,BOS,37,200,7,33,2013-01-01 07:00:00
2013,1,1,1114,900,134,1447,1222,145,UA,1086,N76502,LGA,IAH,248,1416,9,0,2013-01-01 09:00:00
2013,1,1,1540,1338,122,2020,1825,115,B6,705,N570JB,JFK,SJU,193,1598,13,38,2013-01-01 13:00:00
2013,1,1,1815,1325,290,2120,1542,338,EV,4417,N17185,EWR,OMA,213,1134,13,25,2013-01-01 13:00:00
2013,1,1,1842,1422,260,1958,1535,263,EV,4633,N18120,EWR,BTV,46,266,14,22,2013-01-01 14:00:00
2013,1,1,1856,1645,131,2212,2005,127,AA,181,N323AA,JFK,LAX,336,2475,16,45,2013-01-01 16:00:00
2013,1,1,1934,1725,129,2126,1855,151,MQ,4255,N909MQ,JFK,BNA,154,765,17,25,2013-01-01 17:00:00
2013,1,1,1938,1703,155,2109,1823,166,EV,4300,N18557,EWR,RIC,68,277,17,3,2013-01-01 17:00:00
2013,1,1,1942,1705,157,2124,1830,174,MQ,4410,N835MQ,JFK,DCA,60,213,17,5,2013-01-01 17:00:00


In [21]:
# Flights that departed on January 1
flights |>
  filter(month == 1 & day == 1) |>
  sample_n(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,1318,1322,-4,1358,1416,-18,EV,4106,N19554,EWR,BDL,25,116,13,22,2013-01-01 13:00:00
2013,1,1,805,800,5,1118,1106,12,B6,3,N570JB,JFK,FLL,165,1069,8,0,2013-01-01 08:00:00
2013,1,1,857,905,-8,1107,1120,-13,DL,181,N321NB,LGA,DTW,110,502,9,5,2013-01-01 09:00:00
2013,1,1,1251,1252,-1,1611,1555,16,B6,85,N657JB,JFK,FLL,173,1069,12,52,2013-01-01 12:00:00
2013,1,1,629,630,-1,824,810,14,AA,303,N3CYAA,LGA,ORD,140,733,6,30,2013-01-01 06:00:00
2013,1,1,1515,1437,38,1834,1742,52,B6,347,N178JB,JFK,SRQ,171,1041,14,37,2013-01-01 14:00:00
2013,1,1,1005,1000,5,1239,1234,5,UA,1625,N81449,EWR,DEN,254,1605,10,0,2013-01-01 10:00:00
2013,1,1,1720,1725,-5,2121,2105,16,DL,513,N723TW,JFK,LAX,363,2475,17,25,2013-01-01 17:00:00
2013,1,1,1456,1500,-4,1649,1632,17,UA,685,N802UA,LGA,ORD,140,733,15,0,2013-01-01 15:00:00
2013,1,1,1422,1410,12,1613,1555,18,MQ,4491,N737MQ,LGA,CLE,93,419,14,10,2013-01-01 14:00:00


In [20]:
# Flights that departed in January or February
flights |>
  filter(month == 1 | month == 2) |>
  sample_n(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,16,2008,2005,3,2254,2258,-4,UA,405,N411UA,EWR,MCO,141,937,20,5,2013-01-16 20:00:00
2013,1,28,900,905,-5,1125,1115,10,MQ,4478,N739MQ,LGA,DTW,109,502,9,5,2013-01-28 09:00:00
2013,2,16,1758,1745,13,2114,2136,-22,DL,31,N721TW,JFK,SFO,348,2586,17,45,2013-02-16 17:00:00
2013,1,18,1541,1530,11,1755,1734,21,US,1665,N716UW,LGA,CLT,86,544,15,30,2013-01-18 15:00:00
2013,1,24,2348,2359,-11,418,444,-26,B6,739,N605JB,JFK,PSE,193,1617,23,59,2013-01-24 23:00:00
2013,2,18,1600,1604,-4,1718,1739,-21,UA,1053,N76254,EWR,ORD,120,719,16,4,2013-02-18 16:00:00
2013,2,27,2020,1900,80,2144,2018,86,EV,5714,N827AS,JFK,IAD,48,228,19,0,2013-02-27 19:00:00
2013,2,18,2022,2010,12,2325,2321,4,UA,1299,N37255,EWR,RSW,164,1068,20,10,2013-02-18 20:00:00
2013,1,21,601,608,-7,654,725,-31,UA,733,N822UA,EWR,BOS,32,200,6,8,2013-01-21 06:00:00
2013,2,28,1459,1500,-1,1747,1742,5,DL,2347,N6708D,LGA,ATL,112,762,15,0,2013-02-28 15:00:00


In [23]:
# A shorter way to select flights that departed in January or February
flights |>
  filter(month %in% c(1, 2)) |>
  sample_n(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,24,919.0,920,-1.0,1242.0,1233,9.0,UA,1275,N57869,EWR,LAX,332.0,2454,9,20,2013-01-24 09:00:00
2013,1,4,810.0,810,0.0,1029.0,1030,-1.0,FL,346,N899AT,LGA,ATL,115.0,762,8,10,2013-01-04 08:00:00
2013,1,6,1126.0,1130,-4.0,1253.0,1306,-13.0,EV,4431,N16151,EWR,RDU,75.0,416,11,30,2013-01-06 11:00:00
2013,1,22,1105.0,1105,0.0,1241.0,1245,-4.0,WN,542,N443WN,LGA,MDW,121.0,725,11,5,2013-01-22 11:00:00
2013,1,31,1801.0,1800,1.0,1922.0,1913,9.0,US,2185,N748UW,LGA,DCA,47.0,214,18,0,2013-01-31 18:00:00
2013,1,30,1523.0,1345,98.0,1754.0,1641,73.0,B6,1783,N805JB,JFK,MCO,139.0,944,13,45,2013-01-30 13:00:00
2013,1,25,730.0,710,20.0,933.0,850,43.0,MQ,3737,N507MQ,EWR,ORD,129.0,719,7,10,2013-01-25 07:00:00
2013,1,5,537.0,540,-3.0,831.0,850,-19.0,AA,1141,N5DBAA,JFK,MIA,153.0,1089,5,40,2013-01-05 05:00:00
2013,2,26,1556.0,1605,-9.0,1912.0,1911,1.0,B6,157,N794JB,JFK,MCO,157.0,944,16,5,2013-02-26 16:00:00
2013,2,9,,1600,,,1730,,9E,3453,,JFK,BOS,,187,16,0,2013-02-09 16:00:00


## Intro to `arrange`

In [25]:
flights |>
  arrange(desc(year), desc(month), desc(day), desc(dep_time)) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,2356,2359,-3,436,445,-9,B6,745,N665JB,JFK,PSE,200,1617,23,59,2013-12-31 23:00:00
2013,12,31,2355,2359,-4,430,440,-10,B6,1503,N509JB,JFK,SJU,195,1598,23,59,2013-12-31 23:00:00
2013,12,31,2332,2245,47,58,3,55,B6,486,N334JB,JFK,ROC,60,264,22,45,2013-12-31 22:00:00
2013,12,31,2328,2330,-2,412,409,3,B6,1389,N651JB,EWR,SJU,198,1608,23,30,2013-12-31 23:00:00
2013,12,31,2321,2250,31,46,8,38,B6,2002,N179JB,JFK,BUF,66,301,22,50,2013-12-31 22:00:00
2013,12,31,2310,2255,15,7,2356,11,B6,718,N279JB,JFK,BOS,40,187,22,55,2013-12-31 22:00:00
2013,12,31,2245,2250,-5,2359,2356,3,B6,1816,N318JB,JFK,SYR,51,209,22,50,2013-12-31 22:00:00
2013,12,31,2235,2245,-10,2351,2355,-4,B6,234,N355JB,JFK,BTV,49,266,22,45,2013-12-31 22:00:00
2013,12,31,2218,2219,-1,315,304,11,B6,1203,N625JB,JFK,SJU,202,1598,22,19,2013-12-31 22:00:00
2013,12,31,2211,2159,12,100,45,15,B6,1183,N715JB,JFK,MCO,148,944,21,59,2013-12-31 21:00:00


## Intro to `distinct`

In [29]:
# Remove duplicate rows, if any
flights |>
  distinct() |>
  sample_n(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,6,29,1354,1349,5,1459,1504,-5,B6,118,N318JB,JFK,BOS,42,187,13,49,2013-06-29 13:00:00
2013,5,21,1903,1909,-6,2159,2235,-36,B6,87,N633JB,JFK,SLC,265,1990,19,9,2013-05-21 19:00:00
2013,9,19,1830,1830,0,2037,2029,8,DL,548,N316US,EWR,DTW,78,488,18,30,2013-09-19 18:00:00
2013,5,7,1440,1445,-5,1623,1645,-22,US,1445,N193UW,LGA,CLT,78,544,14,45,2013-05-07 14:00:00
2013,10,11,843,850,-7,1001,1003,-2,EV,3810,N14907,EWR,BUF,44,282,8,50,2013-10-11 08:00:00
2013,5,19,2248,2130,78,29,2247,102,EV,4378,N12564,EWR,BTV,60,266,21,30,2013-05-19 21:00:00
2013,6,19,1635,1635,0,1935,1954,-19,B6,15,N821JB,JFK,FLL,148,1069,16,35,2013-06-19 16:00:00
2013,4,21,1449,1455,-6,1630,1632,-2,9E,3318,N907XJ,JFK,BUF,56,301,14,55,2013-04-21 14:00:00
2013,10,17,843,825,18,1131,1120,11,UA,478,N483UA,EWR,MCO,128,937,8,25,2013-10-17 08:00:00
2013,4,6,639,640,-1,936,1000,-24,UA,387,N822UA,EWR,LAX,338,2454,6,40,2013-04-06 06:00:00


In [30]:
# Find all possible destinations
flights |>
  distinct(dest) |>
  head(10)


dest
<chr>
IAH
MIA
BQN
ATL
ORD
FLL
IAD
MCO
PBI
TPA


In [26]:
# Find all unique origin and destination pairs
flights |>
  distinct(origin, dest) |>
  sample_n(10)


origin,dest
<chr>,<chr>
LGA,SAV
EWR,AVL
JFK,TPA
LGA,HOU
EWR,IND
JFK,ABQ
JFK,BOS
JFK,HOU
LGA,CVG
EWR,MDW


In [33]:
# Keep the other columns as well
flights |>
  distinct(origin, dest, .keep_all = TRUE) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


## Intro to `count`

In [35]:
# Top 10 routes
flights |>
  count(origin, dest, sort = TRUE) |>
  head(10)


origin,dest,n
<chr>,<chr>,<int>
JFK,LAX,11262
LGA,ATL,10263
LGA,ORD,8857
JFK,SFO,8204
LGA,CLT,6168
EWR,ORD,6100
JFK,BOS,5898
LGA,MIA,5781
JFK,MCO,5464
EWR,BOS,5327


## Exercises

### Exercise 1.

In a single pipeline for each condition, find all flights that meet the condition:

- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight

In [3]:
head(flights)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In [8]:
flights |>
  filter(arr_delay / 60 >= 2) |>
  filter(dest %in% c("IAH", "HOU")) |>
  filter(carrier %in% c("UA", "AA", "DL")) |>
  filter(month %in% c(7, 8, 9)) |>
  sample_n(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,9,22,2000,1700,180,2232,1955,157,AA,211,N3DDAA,JFK,IAH,184,1417,17,0,2013-09-22 17:00:00
2013,7,10,1629,1520,69,2048,1754,174,UA,646,N578UA,EWR,IAH,187,1400,15,20,2013-07-10 15:00:00
2013,8,9,1850,1730,80,2221,2007,134,UA,268,N524UA,EWR,IAH,182,1400,17,30,2013-08-09 17:00:00
2013,7,22,1524,1249,155,1902,1541,201,UA,215,N430UA,LGA,IAH,187,1416,12,49,2013-07-22 12:00:00
2013,7,22,2018,1735,163,2344,2030,194,AA,1901,N3CAAA,JFK,IAH,196,1417,17,35,2013-07-22 17:00:00
2013,7,9,1937,1735,122,2240,2030,130,AA,1901,N3ARAA,JFK,IAH,174,1417,17,35,2013-07-09 17:00:00
2013,8,22,1956,1735,141,2239,2030,129,AA,1901,N3EEAA,JFK,IAH,182,1417,17,35,2013-08-22 17:00:00
2013,9,2,1833,1520,193,2105,1810,175,UA,498,N497UA,EWR,IAH,184,1400,15,20,2013-09-02 15:00:00
2013,8,16,1652,1259,233,1935,1551,224,UA,643,N428UA,LGA,IAH,198,1416,12,59,2013-08-16 12:00:00
2013,7,25,2147,1520,387,8,1754,374,UA,646,N569UA,EWR,IAH,181,1400,15,20,2013-07-25 15:00:00


### Exercise 2

Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

In [10]:
flights |>
  arrange(desc(dep_delay), dep_time, .keep_all = TRUE) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,6,15,1432,1935,1137,1607,2120,1127,MQ,3535,N504MQ,JFK,CMH,74,483,19,35,2013-06-15 19:00:00
2013,1,10,1121,1635,1126,1239,1810,1109,MQ,3695,N517MQ,EWR,ORD,111,719,16,35,2013-01-10 16:00:00
2013,9,20,1139,1845,1014,1457,2210,1007,AA,177,N338AA,JFK,SFO,354,2586,18,45,2013-09-20 18:00:00
2013,7,22,845,1600,1005,1044,1815,989,MQ,3075,N665MQ,JFK,CVG,96,589,16,0,2013-07-22 16:00:00
2013,4,10,1100,1900,960,1342,2211,931,DL,2391,N959DL,JFK,TPA,139,1005,19,0,2013-04-10 19:00:00
2013,3,17,2321,810,911,135,1020,915,DL,2119,N927DA,LGA,MSP,167,1020,8,10,2013-03-17 08:00:00
2013,6,27,959,1900,899,1236,2226,850,DL,2007,N3762Y,JFK,PDX,313,2454,19,0,2013-06-27 19:00:00
2013,7,22,2257,759,898,121,1026,895,DL,2047,N6716C,LGA,ATL,109,762,7,59,2013-07-22 07:00:00
2013,12,5,756,1700,896,1058,2020,878,AA,172,N5DMAA,EWR,MIA,149,1085,17,0,2013-12-05 17:00:00


### Exercise 3

Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

In [12]:
flights |>
  arrange(arr_time - (dep_time - 2400), .keep_all = TRUE) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,7,17,2400,2142,138,54,2259,115,EV,3832,N22971,EWR,DCA,37,199,21,42,2013-07-17 21:00:00
2013,12,9,2400,2250,70,59,2356,63,B6,1816,N187JB,JFK,SYR,41,209,22,50,2013-12-09 22:00:00
2013,6,12,2338,2129,129,17,2235,102,EV,4276,N11109,EWR,BDL,21,116,21,29,2013-06-12 21:00:00
2013,12,29,2332,2155,97,14,2300,74,EV,4682,N13955,EWR,ALB,26,143,21,55,2013-12-29 21:00:00
2013,11,6,2335,2215,80,18,2317,61,EV,4233,N23139,EWR,ALB,29,143,22,15,2013-11-06 22:00:00
2013,2,25,2347,2145,122,30,2239,111,EV,4378,N18557,EWR,BWI,33,169,21,45,2013-02-25 21:00:00
2013,8,13,2351,2152,119,35,2258,97,EV,4276,N15574,EWR,BDL,24,116,21,52,2013-08-13 21:00:00
2013,10,11,2342,2030,192,27,2205,142,WN,2520,N279WN,EWR,MDW,92,711,20,30,2013-10-11 20:00:00
2013,2,26,2356,2000,236,41,2104,217,EV,4162,N10575,EWR,ALB,24,143,20,0,2013-02-26 20:00:00
2013,1,24,2342,2159,103,28,2300,88,EV,4519,N14916,EWR,BWI,33,169,21,59,2013-01-24 21:00:00


### Exercise 4

Was there a flight on every day of 2013?

In [16]:
flights |>
  distinct(year, month, day) |>
  nrow()


## Exercise 5

Which flights traveled the farthest distance? Which traveled the least distance?

In [17]:
flights |>
  arrange(desc(distance), .keep_all = TRUE) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,857,900,-3,1516,1530,-14,HA,51,N380HA,JFK,HNL,659,4983,9,0,2013-01-01 09:00:00
2013,1,2,909,900,9,1525,1530,-5,HA,51,N380HA,JFK,HNL,638,4983,9,0,2013-01-02 09:00:00
2013,1,3,914,900,14,1504,1530,-26,HA,51,N380HA,JFK,HNL,616,4983,9,0,2013-01-03 09:00:00
2013,1,4,900,900,0,1516,1530,-14,HA,51,N384HA,JFK,HNL,639,4983,9,0,2013-01-04 09:00:00
2013,1,5,858,900,-2,1519,1530,-11,HA,51,N381HA,JFK,HNL,635,4983,9,0,2013-01-05 09:00:00
2013,1,6,1019,900,79,1558,1530,28,HA,51,N385HA,JFK,HNL,611,4983,9,0,2013-01-06 09:00:00
2013,1,7,1042,900,102,1620,1530,50,HA,51,N385HA,JFK,HNL,612,4983,9,0,2013-01-07 09:00:00
2013,1,8,901,900,1,1504,1530,-26,HA,51,N389HA,JFK,HNL,645,4983,9,0,2013-01-08 09:00:00
2013,1,9,641,900,1301,1242,1530,1272,HA,51,N384HA,JFK,HNL,640,4983,9,0,2013-01-09 09:00:00
2013,1,10,859,900,-1,1449,1530,-41,HA,51,N388HA,JFK,HNL,633,4983,9,0,2013-01-10 09:00:00


In [19]:
flights |>
  arrange(distance, .keep_all = TRUE) |>
  head(10)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,7,27,,106,,,245,,US,1632,,EWR,LGA,,17,1,6,2013-07-27 01:00:00
2013,1,3,2127.0,2129,-2.0,2222.0,2224,-2.0,EV,3833,N13989,EWR,PHL,30.0,80,21,29,2013-01-03 21:00:00
2013,1,4,1240.0,1200,40.0,1333.0,1306,27.0,EV,4193,N14972,EWR,PHL,30.0,80,12,0,2013-01-04 12:00:00
2013,1,4,1829.0,1615,134.0,1937.0,1721,136.0,EV,4502,N15983,EWR,PHL,28.0,80,16,15,2013-01-04 16:00:00
2013,1,4,2128.0,2129,-1.0,2218.0,2224,-6.0,EV,4645,N27962,EWR,PHL,32.0,80,21,29,2013-01-04 21:00:00
2013,1,5,1155.0,1200,-5.0,1241.0,1306,-25.0,EV,4193,N14902,EWR,PHL,29.0,80,12,0,2013-01-05 12:00:00
2013,1,6,2125.0,2129,-4.0,2224.0,2224,0.0,EV,4619,N22909,EWR,PHL,22.0,80,21,29,2013-01-06 21:00:00
2013,1,7,2124.0,2129,-5.0,2212.0,2224,-12.0,EV,4619,N33182,EWR,PHL,25.0,80,21,29,2013-01-07 21:00:00
2013,1,8,2127.0,2130,-3.0,2304.0,2225,39.0,EV,4619,N11194,EWR,PHL,30.0,80,21,30,2013-01-08 21:00:00
2013,1,9,2126.0,2129,-3.0,2217.0,2224,-7.0,EV,4619,N17560,EWR,PHL,27.0,80,21,29,2013-01-09 21:00:00


### Exercise 6

Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

**Answer:** yes. Running `filter` and then `arrange` is more efficiente since `arrange` will have less rows to sort through. Functionally both ways return the same result but running `filter` first is generally faster.