## SQL Exercises

### Learning goals:
* Be able to run commands in the Postgres REPL
* Be able to run simple SELECT commands to query what is in the database, and filter records using WHERE
* Be able to run aggregations (GROUP BY), and filter groups using HAVING
* Be able to JOIN different tables, and know which type of join (LEFT, RIGHT, INNER, OUTER) is appropriate
* Be able to create a VIEW
* Explain and understand why VIEWs exist, and why adding columns to the dataframe is a bad idea

### Out-of-scope:
This lesson does not cover
* Getting the data from SQL into other formats (e.g. Python, Pandas, CSV). This is the focus of tomorrow's lecture

### Connecting to the database

In the terminal, run the command `psql`. You should see a prompt `(database)=#`, where `database` is the database you are connected to (the default one you start with is named after your username on your machine). We will connect to the `names` database with
```sql
username=# \connect names;
names=#     -- <--this is your new prompt if you connected successfully
```

In SQL, statements are ended with a semicolon.
* Commands that being with `\` (e.g. `\connect`) are not SQL commands, and the semicolon is optional. It doesn't hurt. Instead, these commands are specific Postgres commands
* SQL commands (those that don't begin with `\`) need a semicolon.

To be safe, you can put a semicolon at the end of all statements.

In this worksheet, empty cells are available for you to copy the SQL strings. _We are not going to try and run anything in this workbook_.

Here are a few "slash" commands that are useful:

| Command | What it does |
| --- | --- |
| `\l` | Lists all databases available |
| `\connect XXXXX` | Connects to database XXXXX |
| `\dt` | Lists all tables on the current database |
| `\dv` | Lists all views on the current database |
| `\d` | Lists all tables and views on the current database |
| `\d TABLE_NAME` | Show schema (i.e. **columns** and **data type** of each column) of TABLE_NAME |

Run `\dt;` now at the terminal. You should see the following output:
```
 Schema |   Name    | Type  | Owner  
--------+-----------+-------+--------
 public | candidate | table | damien
 public | election  | table | damien
 public | name_freq | table | damien
 public | region    | table | damien
 ```
 
(The owner will be your name, not `damien`)

## SQL Exercises

We are going to work in the `names` database, and focusing on two tables: `name_freq` and `region`. 

The `name_freq` table tells us how many children of each gender were born in each state per year. If there were fewer than 5 people with that name born in that state per year, the social security administration doesn't report them for privacy reasons.

### Exercise 1 (SELECT, together)
Let's see the first few rows with our first SQL command:

```sql
=# SELECT * FROM name_freq LIMIT 10;
```
If you run this in the terminal, you should see 
```
 state | gender | year |   name    | freq 
-------+--------+------+-----------+------
 AK    | F      | 1951 | Linda     |   79
 AK    | F      | 1951 | Mary      |   77
 AK    | F      | 1951 | Patricia  |   45
 AK    | F      | 1951 | Susan     |   45
 AK    | F      | 1951 | Barbara   |   35
 AK    | F      | 1951 | Kathleen  |   29
 AK    | F      | 1951 | Sandra    |   29
 AK    | F      | 1951 | Margaret  |   28
 AK    | F      | 1951 | Carol     |   27
 AK    | F      | 1951 | Elizabeth |   26
 ```
 
 i.e. there were 79 (female) Linda's born in Alaska in 1951.
 
* `SELECT` gets information from a table (or view), while `LIMIT` tells us how many rows to get. This is important if we have millions of rows of data -- when exploring the data we only want to grab a few rows.
* The syntax for a simple select is 
  `SELECT (comma separated list of columns/fields I want) FROM (table or view)`
  where the limit it optional. If we want all fields, the wildcard `*` is used instead.
* SQL commmands are case insensitive. The command
  `select * from name_freq limit 10`
  would also work. By convention, SQL keywords (`SELECT`, `FROM`, `LIMIT`) are capitalized, while the column names, table names and views are left lowercase

**Common Key:**  
* Used for joining tables
  
  
**LIMIT 10;**  
* used to preview data in SQL databases
  
  
**Don't name your tables after keywords:**  
* (SELECT, FROM, lIMIT, etc)

## Exercise 2 (you do, 2 mins)

Get just the `name` and `frequency` fields from the `name_freq` table for the first 5 records. 

Copy the SQL string into the cell below.

In [None]:
SELECT name, freq
FROM name_freq
LIMIT 10;

## Next: Conditional Selection (WHERE)

## Exercise 3 (WHERE - limiting rows / COUNT aggregation, together)

Let's say we were interested in the number of people named `John`. We can use `WHERE` to grab just those records:
```sql
#= SELECT * FROM name_freq WHERE name='John';
```

This will give us a lot of rows. Our approximation would be one record for each of the 50 states, for 66 years = 3300 rows. If you scroll through the list, you will see at the bottom there are 4110 rows! Try the following query

```sql
#= SELECT * FROM name_freq WHERE name='John' AND gender='F';
```
There are 744 rows of female "John"s! This is presumably an error in data entry.

We found the number of results by scrolling through the list of returned entries. There is an aggregate function, `COUNT`, that will do this for us:
```sql
-- This is a comment in SQL
-- Count the number of rows that use the name John
-- Note same as the first query, except that we are counting rows instead
SELECT COUNT(*) FROM name_freq WHERE name='John';
```
This should return 
```
 count 
-------
  4110
(1 row)
```

**WARNING:**
In SQL, if you are quoting strings you *must* use single quotes (e.g. `'John'`). Double quotes are used for escaping, which is used if you have a column with a name like `COUNT` and you have to tell the difference between the SQL function `COUNT` and the column name `"COUNT"`. The basic rule is, always use single quotes

## Exercise 4 (you, 3 minutes)

Write an SQL query that finds the records for Kate's born in Washington state after 2010. Your query should return 6 rows. 

Copy the query below

In [None]:
SELECT COUNT(*)
FROM name_freq
WHERE name='Kate'
AND year > 2010;

## Exercise 5 (aggregation, GROUP BY, together)

We can use `COUNT` to get the number of rows, but what if we wanted count the number of people named John in our dataset? We can use other aggregation functions, like `SUM`, instead. For example
```sql
=# SELECT sum(freq) FROM name_freq WHERE name='John';
```
will return `2,638,761` (i.e. 2.6 million). If we wanted to find the total number of people in the dataset, regardless of name:
```sql
=# SELECT sum(freq) FROM name_freq;
```
which returns `219,289,140` (i.e. 219 million people)

We can also use `GROUP BY`, which is similar to the Pandas `GROUP BY`. With a group by, we pass in fields, and for each combination of fields we reduce the records in that group to a single row. Here are some examples:

```sql
--- Get the number of people named John by gender
SELECT gender, sum(freq) FROM name_freq WHERE name='John' GROUP BY gender;
--- returns
---  gender |   sum   
--- --------+---------
---  F      |    7330
---  M      | 2631431
```
i.e. for each different gender, we have the sum of frequencies for the rows we were looking at (we selected the rows where the name was John)

```sql
--- Get the number of people named John for each state in 2000
SELECT state, sum(freq) FROM name_freq WHERE name='John' AND year=2000 GROUP BY state;
--- returns
---  state | sum  
--- -------+------
---  AK    |   49
---  AL    |  425
---  AR    |  195
---  AZ    |  229
--- etc
```

What about aliases?
```sql
SELECT state, sum(freq) FROM name_freq WHERE (name='John' OR name='Johnny') AND year=2000 GROUP BY state;
```
  
>Must FULFILL condition 1 first (name='John' OR name='Johnny)  
Then FULFILL condition 2 (AND year=2000)  
    
     

We can even group by multiple things:
```sql
=# SELECT state, year, sum(freq) AS num_johns FROM name_freq WHERE name='John' 
-# GROUP BY state, year
-# ORDER BY year ASC;
--- returns 
---  state | year | num_johns 
--- -------+------+-----------
---  AK    | 1951 |       100
---  AL    | 1951 |      1391
---  AR    | 1951 |       706
---  AZ    | 1951 |       440
---  CA    | 1951 |      5576
---  ...   | .... |      ....
---  AK    | 1952 |       133
---  AL    | 1952 |      1417
---  AR    | 1952 |       726
---  AZ    | 1952 |       468
---  CA    | 1952 |      5901
---  ...   | .... |       ...
```

Note that we slipped in some sorting (ORDER BY) at the end of the command.

**WARNING:**
It is easy to write "illegal" statements with Group By. For example:
```sql
SELECT count(*), year, state FROM name_freq GROUP BY state;
```
is illegal because each group (i.e. `state`) should return one row. But each state has multiple `year` entries (one per record), so we don't know which year to put in here. SQL cannot actually tell what you want to do! Every column that appears after SELECT needs to be EITHER aggregated or part of the groupby.

Some options:
1. Drop `year`:
  ```sql
  SELECT count(*), state FROM name_freq GROUP BY state;
  ``` 
  This is legal because each group is defined by having a single state
2. Make year part of the GROUP BY
  ```sql
  SELECT state, year, COUNT(*) FROM name_freq GROUP BY state, year;
  ```
  This is legal beacuse each group is defined by having a single state and a single year.
3. Aggregate on year
  ```sql
  SELECT state, COUNT(DISTINCT(year)), COUNT(*) FROM name_freq GROUP BY state;
  ```
  This is legal because we have told SQL how to combine the multiple entries for year into a single value

# Exercise 6 (you do, 4 minutes)

Write a query that tells you how many different female names there were per state, per year. 

Hint: you should `COUNT` records, as each record has a unique `(name, gender, state, year)` combination. You are not looking to `SUM`.

You should find that for 1951 that
- California had 1265 distinct female names
- Illinios had 906 distinct female names
- New York had 1156 distinct female names
- Washington had 511 distinct female names

Copy the query below.

In [None]:
# You need to think logically for how to retrieve the data
SELECT state, year, count(*) 
FROM name_freq
WHERE gender='F'
GROUP BY year, state;

### Exercise 7 (SUBQUERIES and HAVING)

Let's find the names that have at least 1000 people with that name in the year 2000. Finding the number of people with a given name is easy:
```sql
--- Find number of people with a given name in the year 2000
SELECT name, sum(freq) AS num_people FROM name_freq WHERE year=2000 GROUP BY name;
```

We now want to take the results of this query, and get only the results where `num_people >= 1000`. i.e. we are doing a select, and from the result of that query, we are querying again. This is called using a _subquery_.

Here are a couple of ways of doing this:
```sql
WITH num_names_in_2000 AS (
    SELECT name, sum(freq) AS num_people FROM name_freq WHERE year=2000 GROUP BY name
  ) 
  SELECT name, num_people FROM num_names_in_2000 WHERE num_people >= 1000 ORDER BY num_people DESC;
```
Here `num_names_in_2000` is a subquery (a temporary view), and then we select from that. It is possible to write this in place as well (but I personally consider it ugly):
```sql
SELECT name, num_people FROM (
    SELECT name, sum(freq) as num_people FROM name_freq WHERE year=2000 GROUP BY name
    ) AS num_names_in_2000 
    WHERE num_people >= 1000
    ORDER BY num_people DESC;
```

The reason for doing two queries is that the thing we want to query on isn't available until _after_ we have done the aggregation (e.g. we don't know the total number of people named "Vivian" until _after_ we have done the group by). There is a simpler alternative if we are just trying to filter by the results of a groupby, called HAVING (it selects groups, rather than selecting individual records). The HAVING version of this query is
```sql
SELECT name,sum(freq) AS num_people FROM name_freq WHERE year=2000 GROUP BY name HAVING sum(freq) >= 1000 ORDER BY num_people DESC;
```

Important to note:
* We use `WHERE` for year, because every record has a year
* We need to use `HAVING` for `num_people` because we only know the total per group, not per record.
* Unfortunately, `HAVING num_people ....` doesn't work. You need to type in the formula `sum(freq)` rather than the alias. This can be enough for me to prefer subselects if the formula is very long (in subselects, I can use the alias).

**Note:**
In the previous examples, we could have grouped by name and then selected the groups by including name in the groupby and using `HAVING name='John'`, instead of `WHERE name='John'`. This is largely a matter of taste. 

In [None]:
# Making code easier to read
WITH num_names_in_2000 AS (
    SELECT name, sum(freq) AS num_people
    FROM name_freq
    WHEERE year=2000
    GROUP BY name
)
SELECT name, num_people
FROM num_names_in_2000
WHERE num_people >= 1000
ORDER BY num_people DESC;

## Exercise 8 (you do, 5 minutes)

Write a query that finds the states that have at least 800 distinct names in 1951. You can use `COUNT(DISTINCT(name))` to count the number of distinct names.

Your query should return 26 rows.

In [None]:
SELECT count(*) FROM (
    SELECT state 
    FROM name_freq
    WHERE year=1951
    GROUP BY state HAVING COUNT(DISTINCT(name)) >= 800);

In [None]:
SELECT state 
FROM name_freq
WHERE year=1951
GROUP BY state HAVING COUNT(DISTINCT(name)) >= 800;

## Exercise 9 (you do, <1 minute)

What are the columns on the `region` table? Select the entire table and look at the rows. 

Copy the query below

In [None]:
SELECT *
FROM region;

## Exercise 10 (you do, 1 minute)

How many different regions are there in the region table?

Copy the query below

In [None]:
SELECT count(distinct(reigion)) 
FROM region;

## Joining and views.

Joining does the same thing as the merge command in Pandas. If we look at the regions table, it tells which region of the US each state belongs to (e.g. `CA` is in the `Pacific`, `IL` is in the `Midwest`, `NY` is in the `Mid_Atlantic`, et cetera).

We might be interested in knowing questions like "What is the most popular girl's name in the South?" 
This is difficult to answer right now, because the information about the names is on one table (`name_freq`), and the list of which states are part of the South belongs to another table (`region`).


In **Pandas**, we would create a new dataframe that merged `name_freq` and `region` on the common `state` column. This would create a brand new dataframe. The disadvantages of this approach are:
* It uses a lot of memory (we copy both `name_freq` and `region` information)
* It allows us to make data inconsistent (e.g. I could change `CA` to the `midwest` for one particular year in the Pandas approach)

For SQL, we join the tables together to make a **VIEW**. A view acts just like a table when selecting, except we cannot add new data to it, update data on it, or delete data from it. The result sets we have been getting back so far are actually views. When we join the `name_freq` and `region` tables together to make a new view, if we change the data on `name_freq` or `region`, the view automatically updates. The disadvantage of this approach is
* It is slower to run (basically the view is recreated each time we query it).

Let's make a join between the `name_freq` and `region` tables, then show how to use it as a view:

```sql
SELECT name_freq.*, region.region FROM 
        name_freq LEFT JOIN region          -- keep all rows of name_freq, match where possible on region
        ON name_freq.state = region.state   -- match rows if name_freq.state is the same as region.state
        LIMIT 20;                           -- only keep 20 rows
```

If we want to keep the region information around to query, instead of making a view each time, we can save it to access whenever we want:
```sql
CREATE VIEW name_freq_region AS (
    SELECT name_freq.*, region.region FROM 
      name_freq LEFT JOIN region
      ON name_freq.state = region.state
      -- no limit!
    );
```

You can access the available views by using the command `\dv`

In [None]:
SELECT name_freq.*, region.region FROM 
        name_freq LEFT JOIN region      
        ON name_freq.state = region.state   
        LIMIT 20; 

In [None]:
CREATE VIEW name_freq_region AS (
    SELECT name_freq.*, region.region FROM 
      name_freq LEFT JOIN region
      ON name_freq.state = region.state
    LIMIT 100
    );

In [1]:
# The result of JOIN BY is a VIEW. It is in a tabular form

# And you cannot do anything on a VIEW. It is NOT a table!
# And anything you do in the original dataframe will change what happens on the view. 
# (View automatically updates when you do anything in the OG datframe)

## Exercise 11 (together, UPDATE)

Let's look at Rhode Island from the region table:
```sql
SELECT * FROM region WHERE state='RI';
-- returns
--  state |   region    
-- -------+-------------
--  RI    | New_England
```

Let's look at 5 rows from the name_freq_region table from Rhode Island:
```sql
SELECT * FROM name_freq_region WHERE state='RI' LIMIT 5;
-- returns 
--  state | gender | year |   name   | freq |   region    
-- -------+--------+------+----------+------+-------------
--  RI    | F      | 1951 | Linda    |  385 | New_England
--  RI    | F      | 1951 | Patricia |  376 | New_England
--  RI    | F      | 1951 | Susan    |  285 | New_England
--  RI    | F      | 1951 | Deborah  |  270 | New_England
--  RI    | F      | 1951 | Kathleen |  258 | New_England
```

Now let's change where Rhode Isalnd is (we will need this for tomorrow)
```sql
UPDATE region SET region='Midwest' WHERE state='RI'; -- changes the region table
```

Now check the changes:
```sql
SELECT * FROM region WHERE state='RI';
-- returns
--  state |   region    
-- -------+-------------
--  RI    | Midwest
```
which was expected. 

What should be returned by this?
```sql
SELECT * FROM name_freq_region WHERE state='RI' LIMIT 5;
```


In [None]:
# Note: don't 'correct' Rhode Island back to New_England. We will do that tomorrow!

## Exercise 12 (you do)

* List the different distinct regions that exist in the region table. There should be one that looks like a typo
* Update the rows with the region that has the typo.


## Future work

Congratulations! You have the SQL fundamentals down.

Now try some of the exercises in [03_SQL_lab_questions.md](03_SQL_lab_questions.md)