In [1]:
library(tidyverse)
library(nycflights13)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.0     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


### Review

In [28]:
load("mi_census.RData"); mi_census

county,stat,year,val
wayne,income,2000,40776
washtenaw,income,2000,51990
wayne,pop,2000,2059000
washtenaw,pop,2000,324491
wayne,income,2010,38380
washtenaw,income,2010,65500
wayne,pop,2010,1815000
washtenaw,pop,2010,345515


Reshape the table to obtain:
<table>
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <td>
       <table>
<thead><tr><th scope="col">county</th><th scope="col">stat</th><th scope="col">2000</th><th scope="col">2010</th></tr></thead>
<tbody>
	<tr><td>washtenaw</td><td>income   </td><td>  51990  </td><td>  65500  </td></tr>
	<tr><td>washtenaw</td><td>pop      </td><td> 324491  </td><td> 345515  </td></tr>
	<tr><td>wayne    </td><td>income   </td><td>  40776  </td><td>  38380  </td></tr>
	<tr><td>wayne    </td><td>pop      </td><td>2059000  </td><td>1815000  </td></tr>
</tbody>
</table>
    </td>
    <td>
    <table>
<thead><tr><th scope="col">county</th><th scope="col">income_2000</th><th scope="col">population_2000</th><th scope="col">income_2010</th><th scope="col">population_2010</th></tr></thead>
<tbody>
	<tr><td>wayne    </td><td>40776    </td><td>2059000  </td><td>38380    </td><td>1815000  </td></tr>
	<tr><td>washtenaw</td><td>51990    </td><td> 324491  </td><td>65500    </td><td> 345515  </td></tr>
</tbody>
</table>
    </td>
    </tr>
    </table>

# Lecture 09: Relational Data
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand what makes [tidy data](#Tidy-data) and why we care
* [Gather](#Gather) multiple columns into one
* [Spread](#Spread) one column into several
* [Separate](#Separate) and [unite](#Unite) columns
* Impute [missing values](#Missing-values)
</div>


We have already spent a lot of time analyzing the `flights` table. In fact, there are four other tables in `nycflights13` that contain related information about these flights:

In [120]:
print(airlines)

# A tibble: 16 x 2
   carrier                        name
     <chr>                       <chr>
 1      9E           Endeavor Air Inc.
 2      AA      American Airlines Inc.
 3      AS        Alaska Airlines Inc.
 4      B6             JetBlue Airways
 5      DL        Delta Air Lines Inc.
 6      EV    ExpressJet Airlines Inc.
 7      F9      Frontier Airlines Inc.
 8      FL AirTran Airways Corporation
 9      HA      Hawaiian Airlines Inc.
10      MQ                   Envoy Air
11      OO       SkyWest Airlines Inc.
12      UA       United Air Lines Inc.
13      US             US Airways Inc.
14      VX              Virgin America
15      WN      Southwest Airlines Co.
16      YV          Mesa Airlines Inc.


In [121]:
print(airports)

# A tibble: 1,458 x 8
     faa                           name      lat        lon   alt    tz   dst
   <chr>                          <chr>    <dbl>      <dbl> <int> <dbl> <chr>
 1   04G              Lansdowne Airport 41.13047  -80.61958  1044    -5     A
 2   06A  Moton Field Municipal Airport 32.46057  -85.68003   264    -6     A
 3   06C            Schaumburg Regional 41.98934  -88.10124   801    -6     A
 4   06N                Randall Airport 41.43191  -74.39156   523    -5     A
 5   09J          Jekyll Island Airport 31.07447  -81.42778    11    -5     A
 6   0A9 Elizabethton Municipal Airport 36.37122  -82.17342  1593    -5     A
 7   0G6        Williams County Airport 41.46731  -84.50678   730    -5     A
 8   0G7  Finger Lakes Regional Airport 42.88356  -76.78123   492    -5     A
 9   0P2   Shoestring Aviation Airfield 39.79482  -76.64719  1000    -5     U
10   0S9          Jefferson County Intl 48.05381 -122.81064   108    -8     A
# ... with 1,448 more rows, and 1 more var

In [122]:
print(planes)

# A tibble: 3,322 x 9
   tailnum  year                    type     manufacturer     model engines
     <chr> <int>                   <chr>            <chr>     <chr>   <int>
 1  N10156  2004 Fixed wing multi engine          EMBRAER EMB-145XR       2
 2  N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 3  N103US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 4  N104UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 5  N10575  2002 Fixed wing multi engine          EMBRAER EMB-145LR       2
 6  N105UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 7  N107US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 8  N108UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
 9  N109UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
10  N110UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
# ... with 3,312 more rows, and 3 more variables: seats <int>, spe

In [123]:
print(weather)

# A tibble: 26,130 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
    <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
 2    EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
 3    EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
 4    EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
 5    EWR  2013     1     1     4 37.94 24.08 57.04      240   14.96014
 6    EWR  2013     1     1     6 39.02 26.06 59.37      270   10.35702
 7    EWR  2013     1     1     7 39.02 26.96 61.63      250    8.05546
 8    EWR  2013     1     1     8 39.02 28.04 64.43      240   11.50780
 9    EWR  2013     1     1     9 39.92 28.04 62.21      250   12.65858
10    EWR  2013     1     1    10 39.02 28.04 64.43      260   12.65858
# ... with 26,120 more rows, and 5 more variables: wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl

Together these four tables form a *relational database*. The relationships can be graphed like so:
![table relationships](http://r4ds.had.co.nz/diagrams/relational-nycflights.png)

The particular relationships in this database are:
- `flights` connects to `planes` via `tailnum`.
- `flights` connects to `airlines` via `carrier`.
- `flights` connects to `airports` twice: via `origin` and `dest`.
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour`.

## Keys
The "key" to understanding relational databases is... keys. 



### Primary Key
A *primary key* is a variable (or set of variables) that uniquely identifies an observation in its own table: there is **at most** one row in the table that corresponds to any setting of the columns which comprise the key.

In the `planes` table, each airplane is identified by its `tailnum`:

In [39]:
print(planes)

# A tibble: 3,322 x 9
   tailnum  year type          manufacturer   model  engines seats speed engine 
   <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
 1 N10156   2004 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 2 N102UW   1998 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 3 N103US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 4 N104UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 5 N10575   2002 Fixed wing m… EMBRAER        EMB-1…       2    55    NA Turbo-…
 6 N105UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 7 N107US   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 8 N108UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
 9 N109UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
10 N110UW   1999 Fixed wing m… AIRBUS INDUST… A320-…       2   182    NA Turbo-…
# ... 

The tail number of an airplane is assigned by a government agency and is unique: no two planes can have the same tail number. Thus, `tailnum` should be a primary key in this table. 

To check that one or more variables constitutes a primary key, we can group by those variables and then check that the number of distinct values equals the number of rows in the data set:

In [40]:
planes %>% nrow
planes %>% distinct(tailnum) %>% nrow

Compare with `flights`, where tailnum does *not* uniquely identify each row. (There are many flights present for the same airplane.)

In [42]:
flights %>% distinct(tailnum) %>% nrow

What is the primary key for the `flights` table?

In [127]:
print(flights)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more row

We might guess that `year`, `month`, `day`, and `tailnum` are sufficient to identify each row in `flights`, but this is not true:

In [128]:
flights %>% summarize(n=n(), nd=n_distinct(year, month, day, tailnum))

  n      nd    
1 336776 251727

In fact, even restricting to the exact *minute* that an airplane departed is not sufficient:

In [129]:
flights %>% summarize(n=n(), nd=n_distinct(tailnum, time_hour, minute))

  n      nd    
1 336776 336367

This says that there are certain airplanes that are marked as having departed more than once in the same year, month, day, hour and minute. We can inspect these rows as follows:

In [6]:
group_by(flights, tailnum, time_hour, minute) %>% count %>% filter(n>1) %>% arrange(tailnum, time_hour) %>% print

# A tibble: 298 x 4
# Groups:   tailnum, time_hour, minute [298]
   tailnum time_hour           minute     n
   <chr>   <dttm>               <dbl> <int>
 1 N11119  2013-06-10 16:00:00     55     2
 2 N11192  2013-08-26 08:00:00     30     2
 3 N12563  2013-02-04 16:00:00     19     2
 4 N12564  2013-01-13 20:00:00      0     2
 5 N12900  2013-07-10 21:00:00     29     2
 6 N13969  2013-01-28 07:00:00     59     2
 7 N14148  2013-03-12 06:00:00     30     2
 8 N14558  2013-04-19 13:00:00     29     2
 9 N14916  2013-02-11 13:00:00     15     2
10 N14974  2013-07-26 06:00:00     30     2
# ... with 288 more rows


These likely indicate rounding or data entry errors.

### Exercise
What constitutes a primary key in the `mpg` table?

In [47]:
# Your code here

### Foreign Key
A *foreign key* uniquely identifies an observation in another table.

If you are working in the `planes` table, `flights$tailnum` is a foreign key because it appears in the `flights` table and matches each flight to a unique plane. 

In [132]:
flights %>% slice(1) %>% select(tailnum, everything()) %>% print
planes %>% filter(tailnum == flights$tailnum[1])

# A tibble: 1 x 19
  tailnum  year month   day dep_time sched_dep_time dep_delay arr_time
    <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  N14228  2013     1     1      517            515         2      830
# ... with 11 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>


  tailnum year type                    manufacturer model   engines seats speed
1 N14228  1999 Fixed wing multi engine BOEING       737-824 2       149   NA   
  engine   
1 Turbo-fan

If you are working in the `weather` table, `airports$faa` is a foreign key for `origin`.

In [133]:
weather %>% slice(1) %>% print
airports %>% filter(faa==weather$origin[1])

# A tibble: 1 x 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
# ... with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>


  faa name                lat     lon       alt tz dst tzone           
1 EWR Newark Liberty Intl 40.6925 -74.16867 18  -5 A   America/New_York

A variable can be both a primary key and a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` table.

### Relations
A primary key and the corresponding foreign key in another table form a *relation*. Relations come in several forms:
- *One-to-many*. (Most common). For example, each flight has one plane, but each plane has many flights. 
- *Many-to-many*: For example, each airline flies to many airports; each airport hosts many airlines.
- *One-to-one*. Each row in one table corresponds uniquely to a row in a second table. This is relatively uncommon because you could just as easily combine the two tables into one.

In [135]:
x = tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y = tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

## Joins
Joins are the way the we combine or "merge" two data tables based on keys.
To understand how joins work we'll study these two simple tables:
![simple tables](http://r4ds.had.co.nz/diagrams/join-setup.png)

In [136]:
x
y

  key val_x
1 1   x1   
2 2   x2   
3 3   x3   

  key val_y
1 1   y1   
2 2   y2   
3 4   y3   

The first column of each table is called `key` and serves as the primary key: each row has a different value of `key`. 

Let's image all the possible ways we could join together these two tables. Each intersecting line represents a potential match; there are 3 observations in each table for a total of $3^2=9$ intersections.
![possible joins](http://r4ds.had.co.nz/diagrams/join-setup2.png)

Matches will be indicated with dots:
![match example](http://r4ds.had.co.nz/diagrams/join-inner.png)

### Inner joins
Inner joins match a pair of observations whenever their keys are equal:
![match example](http://r4ds.had.co.nz/diagrams/join-inner.png)

In [137]:
x %>% inner_join(y, by = "key")

  key val_x val_y
1 1   x1    y1   
2 2   x2    y2   

Note that there is no row for `key=3` or `key=4`: with an inner join, unmatched rows are not included in the result. For this reason, we do not as often use inner joins for data analysis since you can easily lose observations.

### Outer joins
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in x and y.

![outer join](http://r4ds.had.co.nz/diagrams/join-outer.png)

Left joins are the most common. Use them to look up data in another table, but preserve your original observations when there in cases where the other table does not have a match.

### Example
The `flights` table has a `carrier` column which is a two-letter code for the airline. The `airlines` table maps these code to recognizable airline names. You can combine the `airlines` and `flights` data frames with `left_join()`:

In [1]:
left_join(flights, airlines) %>% 
    select(year, month, day, tailnum, carrier, name) %>% count(name)

ERROR: Error in left_join(flights, airlines) %>% select(year, month, day, tailnum, : could not find function "%>%"


### Exercise
Use `left_join()` to determine:
<table>
    <tr><th>Beginner</th><th>Advanced</th></tr>
    <tr>
        <td>The number of flights operated by Envoy Air.</td>
        <td>The number of flights that departed in sub-zero conditions (<code>temp < 0</code>) in 2013.</td>
    </tr>
    </table>

In [None]:
# Your code here

### Defining the key columns
When we do a join using `left_join()`, R take as the key whatever column names the two tables have in common:


In [2]:
left_join(flights, airlines);

ERROR: Error in left_join(flights, airlines): could not find function "left_join"


We can also specify the join column as we did above:

In [140]:
left_join(flights, airlines, by="carrier") %>% print

# A tibble: 336,776 x 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more row

Note that we have to quote the join columns. Otherwise we will get an error:
```{r}
> left_join(flights, airlines, by=carrier)
Error in common_by(by, x, y): object 'carrier' not found
Traceback:

1. left_join(flights, airlines, by = carrier) %>% print
2. eval(lhs, parent, parent)
3. eval(lhs, parent, parent)
4. left_join(flights, airlines, by = carrier)
5. left_join.tbl_df(flights, airlines, by = carrier)
6. common_by(by, x, y)
```

Finally, we can be explicit about the column names. This is necessary in situations where the key has a different name in each table. For example, to join the `airports` table to `flights`, we need to match the `origin`/`dest` column in flights to the `faa` column in `airports`:

In [141]:
flights %>% left_join(airports, c("dest" = "faa")) %>% select("origin", "dest", 20:23) %>% print

# A tibble: 336,776 x 6
   origin  dest                            name      lat       lon   alt
    <chr> <chr>                           <chr>    <dbl>     <dbl> <int>
 1    EWR   IAH    George Bush Intercontinental 29.98443 -95.34144    97
 2    LGA   IAH    George Bush Intercontinental 29.98443 -95.34144    97
 3    JFK   MIA                      Miami Intl 25.79325 -80.29056     8
 4    JFK   BQN                            <NA>       NA        NA    NA
 5    LGA   ATL Hartsfield Jackson Atlanta Intl 33.63672 -84.42807  1026
 6    EWR   ORD              Chicago Ohare Intl 41.97860 -87.90484   668
 7    EWR   FLL  Fort Lauderdale Hollywood Intl 26.07258 -80.15275     9
 8    LGA   IAD          Washington Dulles Intl 38.94453 -77.45581   313
 9    JFK   MCO                    Orlando Intl 28.42939 -81.30899    96
10    LGA   ORD              Chicago Ohare Intl 41.97860 -87.90484   668
# ... with 336,766 more rows


(The rest of the material is optional)

## SQL queries

SQL stands for "Structured Query Language". Many large databases are stored in SQL format, and you will probably encounter one if you work on big data and/or at a large company. 

To introduce SQL we're going to use the `sqldf` package, which lets us run SQL queries on R tibbles/data frames. Also, to make things go faster, we'll operate on a subsetted version of flights which takes 1% of randomly sampled rows.

In [39]:
# install.packages("sqldf") if necessary
library(sqldf)
flights_sub <- filter(flights, runif(n=nrow(flights)) < .01)
# This function takes the data.frame outputted from sqldf and converts it to a tibble
sqltbl <- function(...) sqldf(...) %>% tbl_df

## Selecting data
The most basic SQL query you will encounter selects data from a table. 

In [23]:
sqltbl("SELECT * from flights_sub") %>% print
flights_sub %>% print

# A tibble: 3,240 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     1713           1700        13     2006           2014
 2  2013     1     1     1842           1422       260     1958           1535
 3  2013     1     1     1952           1930        22     2257           2251
 4  2013     1     1     2008           1855        73     2223           2100
 5  2013     1     2      634            630         4      806            810
 6  2013     1     2      637            631         6      821            815
 7  2013     1     2      732            732         0     1047           1040
 8  2013     1     2      817            820        -3     1015           1007
 9  2013     1     2      820            827        -7     1117           1105
10  2013     1     2      904            910        -6     1026           1027
# ... with 3,230 more rows, a

#### Selecting data from a table
The SQL syntax for selecting column(s) from a table is
```{sql}
SELECT <col1>, <col2>, ..., <coln> FROM <table>
```
Note the similarity to the corresponding `tidyverse` command:
```{r}
select(<table>, <col1>, <col2>, ..., <coln>)
```

In [25]:
sqltbl("SELECT tailnum FROM flights_sub") %>% print
flights_sub %>% select(tailnum) %>% print

# A tibble: 3,240 x 1
   tailnum
   <chr>  
 1 N346JB 
 2 N18120 
 3 N76523 
 4 N527MQ 
 5 N3DYAA 
 6 N39297 
 7 N73291 
 8 N931XJ 
 9 N279JB 
10 N178JB 
# ... with 3,230 more rows
# A tibble: 3,240 x 1
   tailnum
   <chr>  
 1 N346JB 
 2 N18120 
 3 N76523 
 4 N527MQ 
 5 N3DYAA 
 6 N39297 
 7 N73291 
 8 N931XJ 
 9 N279JB 
10 N178JB 
# ... with 3,230 more rows


The special character `*` in SQL will match all columns:

In [31]:
sqltbl("SELECT * FROM flights_sub") %>% print

# A tibble: 3,240 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     1713           1700        13     2006           2014
 2  2013     1     1     1842           1422       260     1958           1535
 3  2013     1     1     1952           1930        22     2257           2251
 4  2013     1     1     2008           1855        73     2223           2100
 5  2013     1     2      634            630         4      806            810
 6  2013     1     2      637            631         6      821            815
 7  2013     1     2      732            732         0     1047           1040
 8  2013     1     2      817            820        -3     1015           1007
 9  2013     1     2      820            827        -7     1117           1105
10  2013     1     2      904            910        -6     1026           1027
# ... with 3,230 more rows, a

#### Filtering

The SQL syntax for filtering rows in a table uses the `WHERE` clause:
```{sql}
SELECT * FROM <table> WHERE dest="IAH"
```
This is the same as:
```{r}
filter(<table>, dest=="IAH")
```
Note that SQL uses a single `=` to check equality!

In [33]:
sqltbl('SELECT * FROM flights_sub WHERE dest="IAH"') %>% print
filter(flights_sub, dest=="IAH") %>% print

# A tibble: 84 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     2      732            732         0     1047           1040
 2  2013     1    28     1035           1015        20     1332           1334
 3  2013     1    30      746            752        -6     1111           1058
 4  2013    10     2      855            855         0     1121           1156
 5  2013    10     5     1023           1030        -7     1312           1326
 6  2013    10     5     1634           1629         5     1915           1919
 7  2013    10     7     1429           1415        14     1718           1704
 8  2013    10    20     1653           1700        -7     1927           1953
 9  2013    10    21      903            902         1     1211           1148
10  2013    11     3     1020           1029        -9     1305           1325
# ... with 74 more rows, and 11 

#### Summarizing

The SQL syntax for summarizing is using the `GROUP BY` clause:
```{sql}
SELECT summary_func(<col>) AS new_col FROM <table> GROUP_BY(dest)
```
This is the same as:
```{r}
group_by(<table>) %>% summarize(new_col=summary_func(<col>))
```

In [35]:
sqltbl("SELECT origin, COUNT() as n FROM flights_sub GROUP BY origin")
flights_sub %>% group_by(origin) %>% summarize(n=n())

  origin n   
1 EWR    1142
2 JFK    1062
3 LGA    1036

  origin n   
1 EWR    1142
2 JFK    1062
3 LGA    1036

#### Joins

The SQL syntax for joins:
```{sql}
SELECT * FROM <table> (LEFT|RIGHT|'') (INNER|OUTER) JOIN <other_table> ON <condition>
```
This is the same as:
```{r}
left_join(<table>, <other_table>, key=<key>)
```
SQL is more general in specifying the join condition. Whereas in tidyverse it must be a key, in
SQL it can be a general logical condition.

In [38]:
sqltbl('SELECT flights_sub.year, flights_sub.month, flights_sub.day, 
               planes.tailnum, planes.manufacturer FROM flights_sub 
        LEFT OUTER JOIN planes ON flights_sub.tailnum=planes.tailnum') %>% print

# A tibble: 3,240 x 5
    year month   day tailnum manufacturer  
   <int> <int> <int> <chr>   <chr>         
 1  2013     1     1 N346JB  EMBRAER       
 2  2013     1     1 N18120  EMBRAER       
 3  2013     1     1 N76523  BOEING        
 4  2013     1     1 <NA>    <NA>          
 5  2013     1     2 <NA>    <NA>          
 6  2013     1     2 N39297  BOEING        
 7  2013     1     2 N73291  BOEING        
 8  2013     1     2 N931XJ  BOMBARDIER INC
 9  2013     1     2 N279JB  EMBRAER       
10  2013     1     2 N178JB  EMBRAER       
# ... with 3,230 more rows


Note here that SQL requires us to be explicit about which columns we are `SELECT`ing when joining multiple tables. Each column name must be prefixed with the name of the table in which it resides.