# SQL Indexes and basic ETL

This week we'll take a look at working with indexes and how examining query plans can shed light on query performance and database design.  Then we'll switch gears and look at extracting values from transactional data in a variety of ways.

## Setup - install PostgreSQL (Optional)

We are going to use [PostgreSQL](https://www.postgresql.org) 9.5 or later version this time. If you are using AWS EC2 instances based on our AMI, you can skip this section. If postgresql is not installed, follow the [instructions](https://www.postgresql.org/download/linux/) to install it.

In order to connect to PostgreSQL, we need to make sure [ipython-sql](https://github.com/catherinedevlin/ipython-sql) and [psycopg2](https://github.com/psycopg/psycopg2) libraries are installed.

In [None]:
!pip freeze | grep -E 'ipython-sql|psycopg2'

If you see something like this, you are all set:
```
ipython-sql==0.3.8
psycopg2==2.6.2
```

## Setup - bikeshare data, again

We'll download the same Bikeshare data you've worked with before, and we'll create some database tables and indexes more deliberately before using PostgreSQL.

First, use PostgreSQL's `dropdb` command to drop the database named week6, if it exists. It is necessary so that we can run this notebook repeatedly. If you get the error that says "database week6 does not exist", that is fine. However if it complains that "There is 1 other session using the database", please restart the Kernel and try it again.

In [None]:
!dropdb --help

In [None]:
!dropdb -U student week6

Now use PostgreSQL's `createdb` command to create the database named week6. 

In [None]:
!createdb -U student week6

In [None]:
%load_ext sql

Use sql magic to connect to the database we just created. The URL format is dialect+driver://username:password@host:port/database. Use `student` as the user name. Password is not required here.

In [None]:
%sql postgresql://student@/week6

In [None]:
!wget -O 2017-Q1-trips.zip https://s3.amazonaws.com/capitalbikeshare-data/2017-Q1-cabi-trips-history-data.zip

In [None]:
!unzip -o 2017-Q1-trips.zip

In [None]:
!mv 2017-Q1-Trips-History-Data.csv 2017q1.csv

In [None]:
!wc -l 2017q1.csv

In [None]:
!csvcut -n 2017q1.csv

### Two ways to create this table

First, what we did before, renaming the header line by hand to make it easier to read and type queries.

In [None]:
!echo "duration_ms,start_date,end_date,start_station_id,start_station,end_station_id,end_station,bike_id,member_type" > rides.csv

In [None]:
!tail -n +2 2017q1.csv >> rides.csv

Let's make sure we did that correctly. The numbers match and the header line looks good.

In [None]:
!wc -l rides.csv

In [None]:
!head -3 rides.csv | csvlook

We can use csvsql command to load the CSV file into the database.

**WARNING**: The following cell may take a very long time to complete. You may want to skip the next two code cells. You can interrupt the Kernel to stop loading if you don't want to wait anymore.

In [None]:
!csvsql --db postgresql://student@/week6 --insert rides.csv

In [None]:
%%sql
SELECT COUNT(*) FROM rides;

That was very slow.  Let's look at something more direct, using PostgreSQL's support for CSV import.

First, we take a look at a sample of the data to determine its attributes' domains and ranges.

In [None]:
!head -n 10000 rides.csv | csvstat

Based on these values, I expect we can work with the following:

In [None]:
%%sql
DROP TABLE IF EXISTS rides;
CREATE TABLE rides (
    duration_ms INTEGER NOT NULL,
    start_date TIMESTAMP NOT NULL,
    end_date TIMESTAMP NOT NULL,
    start_station_id INTEGER NOT NULL,
    start_station VARCHAR(64) NOT NULL,
    end_station_id INTEGER NOT NULL,
    end_station VARCHAR(64) NOT NULL,
    bike_number CHAR(21) NOT NULL,
    member_type CHAR(10) NOT NULL
)

Now we'll load the data directly using `COPY` command.  Note that this **requires** the use of an absolute path, so adjust it to your location:

In [None]:
!pwd

In [None]:
%%sql
COPY rides FROM '/home/ubuntu/lectures/week-06/2017q1.csv'
CSV
HEADER;

In [None]:
%%sql
SELECT COUNT(*) FROM rides;

By the way, you can extract a schema from a pgsql instance with the following query, which uses the INFORMATION_SCHEMA metadata database.

In [None]:
%%sql
SELECT column_name, data_type, character_maximum_length, is_nullable
FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'rides';

## Working with indexes

Let's find a query that will go a little slow, and see how pgsql plans to implement it.

For example, what are the popular station pairs that result in the longest average rides?

In [None]:
%%sql
SELECT start_station, end_station, member_type, 
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count
FROM rides
GROUP BY start_station, end_station, member_type
ORDER BY minutes DESC
LIMIT 10;

That was a little slow... so we can definitely use indexes to speed things up.  First, though, let's look at the counts.  We want more popular station pairs, first of all, so let's set a minimum count level.

In [None]:
%%sql
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

That's better.  But it's slow.  Let's see how pgsql goes about it using `EXPLAIN`:

In [None]:
%%sql
EXPLAIN
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

There's a lot to unpack in there.  Read it from the inside out to figure out what it's doing.

 * `Seq Scan on rides` - this is a table scan, which will be slow
 * `Sort Key: start_station, end_station, member_type` - we're performing a sort across all three attributes
 * `Filter: (count(*) > 90)` - there's our selection constraint
 * `Sort Key: (round(avg((duration_ms / 60000)), 1)) DESC` - look, another sort!  what's the difference between the two?
 * `Limit (cost=101201.41..101201.44 rows=10 width=63)` - can you guess what each element here means?
 
To speed things up, we need an index.  Let's start with one on `start_station`.

In [None]:
%%sql
DROP INDEX IF EXISTS idx_start_station;
CREATE INDEX idx_start_station ON rides (start_station);

Note that it takes a few seconds - it's building that indexing structure, then storing it on disk.  Remember the metrics we discussed for different index types?  This step invokes both the insert time and storage overhead metrics.

The key thing is whether the query will go faster, so let's check

In [None]:
%%sql
EXPLAIN
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

It doesn't look like it will be any faster?  Why not?

We need to create a different kind of index.  That most-nested sort is being done on a combination of three attributes at once.  So let's create an index on all three.

In [None]:
%%sql
DROP INDEX IF EXISTS idx_stations_member_type;
CREATE INDEX idx_stations_member_type ON rides (start_station, end_station, member_type);

In [None]:
%%sql
EXPLAIN
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Now we're getting somewhere!  Look at that last line:

* `Index Scan using idx_stations_member_type` - this means the optimizer found our index and will use it

And the result is *speed*:

In [None]:
%%sql
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Much better, right?

Notice that those rides start and end at the same stations. And they are all casual riders.

Next, because there are clearly a lot of tourists circling the National Mall, let's look at regular riders.

In [None]:
%%sql
SELECT start_station, end_station, member_type,
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes,
       COUNT(*) AS count 
FROM rides
WHERE member_type = 'Registered'
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Notice that some of them no longer start and end at the same stations. And the rides are shorter for registered riders.

What are the top 10 most popular departing stations in Q1 2017?

In [None]:
%%sql
SELECT start_station, 
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, 
       COUNT(*) AS count
FROM rides
GROUP BY start_station
ORDER BY COUNT(*) DESC
LIMIT 10;

What are the top 10 most popular destination stations in Q1 2017?

In [None]:
%%sql
SELECT end_station, 
       ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, 
       COUNT(*) AS count
FROM rides
GROUP BY end_station
ORDER BY COUNT(*) DESC
LIMIT 10;

Let's do the same, but for bikes. Which 10 bikes were used most in trips departing from the most popular departure station?

In [None]:
%%sql
SELECT bike_number, COUNT(*) AS count
FROM rides
WHERE start_station = 'Columbus Circle / Union Station'
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

Let's try this again all in one with a subquery.  First we make sure we get the nested subquery part right.

In [None]:
%%sql
SELECT start_station
FROM rides
GROUP BY start_station
ORDER BY COUNT(*) DESC
LIMIT 1;

Looks good.  Now let's insert the nested subquery into the other.

In [None]:
%%sql
SELECT bike_number, COUNT(*) AS count
FROM rides
WHERE start_station IN
    (SELECT start_station
     FROM rides
     GROUP BY start_station
     ORDER BY COUNT(*) DESC
     LIMIT 1)
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

Review the query plan from `EXPLAIN` on that one.  See anything familiar in there?

In [None]:
%%sql
EXPLAIN
SELECT bike_number, COUNT(*) AS c
FROM rides
WHERE start_station IN
    (SELECT start_station
     FROM rides
     GROUP BY start_station
     ORDER BY COUNT(*) DESC
     LIMIT 1)
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

## Basic ETL with SQL

Today we'll look at examples of how to extract consistent sets of values out of your database.  ETL as a whole consists of a lot more than just this, but because every environment has their own tools and approach, we'll just be getting a taste of it here.

First let's look at extracting simple details like station names.

In [None]:
%%sql
SELECT DISTINCT start_station
FROM rides
ORDER BY start_station
LIMIT 10;

In [None]:
%%sql
SELECT DISTINCT end_station
FROM rides
ORDER BY end_station
LIMIT 10;

To be sure we get them all, we need to combine them into a union set.

In [None]:
%%sql
SELECT DISTINCT start_station AS station FROM rides
UNION
SELECT DISTINCT end_station AS station FROM rides;

Now we can create a new table to house the unique station names.

In [None]:
%%sql
DROP TABLE IF EXISTS stations;
CREATE TABLE stations (
    id SERIAL,
    name VARCHAR(64)
);

In [None]:
%%sql
INSERT INTO stations (name)
SELECT DISTINCT start_station AS station FROM rides
UNION
SELECT DISTINCT end_station AS station FROM rides;

In [None]:
%%sql
SELECT * FROM stations LIMIT 10;

We can also record the minutes as a new column so we don't have to calculate from milliseconds every time.

In [None]:
%%sql
ALTER TABLE rides
ADD COLUMN duration_min NUMERIC;

The following update could take up to a minute to complete.

In [None]:
%%sql
UPDATE rides
SET duration_min = ROUND(CAST(duration_ms AS NUMERIC) / (1000 * 60), 1);

In [None]:
%%sql
SELECT duration_ms, duration_min FROM rides
LIMIT 5;

Another valuable pattern is to use date functions to extract particular time intervals, such as months or days.  Every RDBMS has its own set of date functions, unfortunately you will likely just have to learn the ones used by the system in your environment.

Read more in the [documentation for PostgreSQL date formatting](https://www.postgresql.org/docs/9.5/static/functions-formatting.html).

In [None]:
%%sql
SELECT start_date,
       EXTRACT(DAY FROM start_date)::integer AS day, 
       EXTRACT(MONTH FROM start_date)::integer AS month, 
       EXTRACT(YEAR FROM start_date)::integer AS year
FROM rides
LIMIT 10;

In data warehouse models and in statistical model feature engineering, it can be particularly useful to extract all kinds of parts of dates out into variables.  You never know where you'll find significance.

This kind of extraction is quite common.

In [None]:
%%sql
SELECT TO_CHAR(start_date, 'YYYY-MM-DD') AS day, 
    TO_CHAR(start_date, 'YYYY') AS year,
    TO_CHAR(start_date, 'MM') AS month,
    TO_CHAR(start_date, 'DD') AS day_of_month,
    TO_CHAR(start_date, 'Day') AS day_of_week_str,
    TO_CHAR(start_date, 'D') AS day_of_week,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) >= 6 
        THEN 1 
        ELSE 0
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) < 6 
        THEN 1 
        ELSE 0
    END AS is_weekday,
    TO_CHAR(start_date, 'HH24') AS hour_of_day,
    TO_CHAR(start_date, 'Q') AS quarter
FROM rides
LIMIT 10;

## GROUPING SETS, ROLLUP, CUBE

This is the regular grouping that display top 10 station pairs.

In [None]:
%%sql
SELECT start_station, end_station, COUNT(*) AS count
FROM rides
GROUP BY start_station, end_station
ORDER BY COUNT(*) DESC
LIMIT 10;

`GROUP BY GROUPING SETS ((start_station, end_station), (member_type), ())` generates a list with the top station pairs as well as the total counts for each member type and for the whole table.

In [None]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS count
FROM rides
GROUP BY GROUPING SETS ((start_station, end_station), (member_type), ())
ORDER BY COUNT(*) DESC
LIMIT 10;

`ROLLUP (start_station, end_station, member_type)` generates a similar set of increasingly aggregated summaries, lopping off one column from the right at a time. It is equivalent to `GROUP BY GROUPING SETS ((start_station, end_station, member_type), (start_station, end_station), (start_station), ())`

In [None]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS count
FROM rides
GROUP BY ROLLUP (start_station, end_station, member_type)
HAVING COUNT(*) > 300;

`CUBE (start_station, end_station, member_type)` generates summaries for the entire set of attributes and its possible subsets. It is equivalent to `GROUP BY GROUPING SETS ((start_station, end_station, member_type), (start_station, end_station), (start_station, member_type), (end_station, member_type), (start_station), (end_station), (member_type), ())`

In [None]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS c
FROM rides
GROUP BY CUBE (start_station, end_station, member_type)
HAVING COUNT(*) > 300;