# More basic ETL

This week we'll take a look at more ETL functions, building up a mini warehouse using Bikeshare and weather data.

## Setup - install PostgreSQL (Optional)

We are going to use [PostgreSQL](https://www.postgresql.org) this time. If you are using datanotebook.org or AWS EC2 instances based on our AMI, you can skip this section. If PostgreSQL is not installed, follow the [instructions](https://www.postgresql.org/download/linux/) to install it.

If PostgreSQL server is not running, execute the following cell to start it:

In [None]:
!sudo /etc/init.d/postgresql start

In order to connect to PostgreSQL, we need to make sure [ipython-sql](https://github.com/catherinedevlin/ipython-sql) and [psycopg2](https://github.com/psycopg/psycopg2) libraries are installed.

In [None]:
!pip install ipython-sql psycopg2

## Setup - bikeshare data, again

We'll download the same Bikeshare data you've worked with before, and we'll create some database tables and indexes more deliberately using PostgreSQL.

In [None]:
%load_ext sql

In [None]:
!createdb week7

In [None]:
%sql postgresql:///week7

In [None]:
!wget https://s3.amazonaws.com/capitalbikeshare-data/2017-Q1-cabi-trips-history-data.zip

In [None]:
!unzip -o 2017-Q1-cabi-trips-history-data.zip

In [None]:
!mv 2017-Q1-Trips-History-Data.csv 2017q1.csv

In [None]:
!wc -l 2017q1.csv

In [None]:
!csvcut -n 2017q1.csv

### Create table and import

Given the volume of data here, let's go straight to pgsql to load the data.

*Note* use `gshuf` if you're on a Mac, otherwise try `shuf`.  Same options should work for both.

In [None]:
!shuf -n 10000 2017q1.csv | csvstat

Based on these values, I expect we can work with the following:

In [None]:
%%sql
DROP TABLE IF EXISTS rides;
CREATE TABLE rides (
    duration_ms INTEGER,
    start_date TIMESTAMP,
    end_date TIMESTAMP,
    start_station_id INTEGER,
    start_station VARCHAR(64),
    end_station_id INTEGER,
    end_station VARCHAR(64),
    bike_number CHAR(21),
    member_type CHAR(10)
)

Now we'll load the data in more simply.  Note that this **requires** the use of an absolute path, so adjust it to your location:

In [None]:
!pwd

In [None]:
%%sql
COPY rides FROM '/home/jovyan/work/gwu/2017/lectures/star-week-01/2017q1.csv'
CSV
HEADER
QUOTE '"'
DELIMITER ',';

In [None]:
%%sql
SELECT COUNT(*) FROM rides;

In [None]:
%%sql
SELECT * FROM rides
LIMIT 10

## More ETL with SQL

Today we'll extend last week's examples of how to extract consistent sets of values out of your database.  

First let's pick up where we left off, extracting simple details like station names.

In [None]:
%%sql
SELECT DISTINCT start_station, start_station_id
FROM rides
ORDER BY start_station
LIMIT 10;

In [None]:
%%sql
SELECT DISTINCT end_station, end_station_id
FROM rides
ORDER BY end_station
LIMIT 10;

To be sure we get them all, we need to combine them into a union set.

In [None]:
%%sql
SELECT DISTINCT start_station AS station, start_station_id AS station_id FROM rides
UNION
SELECT DISTINCT end_station AS station, end_station_id AS station_id FROM rides

Now we can create a new table to house the unique station names.

In [None]:
%%sql
DROP TABLE IF EXISTS stations;
CREATE TABLE stations (
    id SERIAL PRIMARY KEY,
    name VARCHAR(64),
    station_key INTEGER
);

In [None]:
%%sql
INSERT INTO stations (name, station_key)
SELECT DISTINCT start_station AS station, start_station_id AS station_key FROM rides
UNION
SELECT DISTINCT end_station AS station, end_station_id AS station_key FROM rides;

In [None]:
%%sql
SELECT * FROM stations LIMIT 10;

We can even add these new identifiers back to the original table now.

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN start_station_nid INTEGER,
ADD CONSTRAINT fk_start_station
FOREIGN KEY (start_station_nid)
REFERENCES stations (id);

In [None]:
%%sql
UPDATE rides AS r
SET start_station_nid = s.id
FROM stations AS s
WHERE r.start_station = s.name;

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN end_station_nid INTEGER,
ADD CONSTRAINT fk_end_station
FOREIGN KEY (end_station_nid)
REFERENCES stations (id);

In [None]:
%%sql
UPDATE rides AS r
SET end_station_nid = s.id
FROM stations AS s
WHERE r.end_station = s.name;

In [None]:
%%sql
SELECT * FROM rides
LIMIT 5;

### Simple address geocoding

It feels like we should do a little more with the stations, doesn't it?  Let's see if we can geocode them using the [geocoder library](https://geocoder.readthedocs.io/).

In [None]:
!pip install geocoder

In [None]:
import geocoder

### Connecting to the db from python

Here we'll use a little python to update run geocoding queries and flesh out the data a bit more.

In [None]:
%%sql
ALTER TABLE stations
ADD COLUMN lat NUMERIC DEFAULT 0,
ADD COLUMN lng NUMERIC DEFAULT 0;

*Note* the specific user and host names are required at datanotebook.org.  If you're running this locally, you'll need to adjust your username/dbname/etc. as appropriate.

In [None]:
import psycopg2

conn = psycopg2.connect("dbname='week7'")
c = conn.cursor()

In [None]:
c.execute("SELECT id, name FROM stations ORDER BY id ASC")
rows = c.fetchall()
for r in rows:
    station_id, station_name = r
    print('%s: %s' % (station_id, station_name))
    g = geocoder.google('%s Washington DC' % station_name)
    c.execute("UPDATE stations SET lat = (%s), lng = (%s) WHERE id = (%s)", 
              (g.lat, g.lng, station_id))
conn.commit()

In [None]:
%%sql
SELECT AVG(lat), MIN(lat), MAX(lat), STDDEV(lat) FROM stations;

In [None]:
%%sql
SELECT * FROM stations LIMIT 10;

In [None]:
%%sql
SELECT COUNT(*) FROM stations WHERE lat IS NULL OR lng IS NULL;

Looks like it mostly worked.  It's a start, at least.

### Saving a transformation with every query

Another useful step might be recording the minutes as a new column so we don't have to calculate from milliseconds every time.

In [None]:
%%sql
ALTER TABLE rides
ADD COLUMN duration_min NUMERIC;

In [None]:
%%sql
UPDATE rides
SET duration_min = ROUND(CAST(duration_ms AS NUMERIC) / (1000 * 60), 1);

In [None]:
%%sql
SELECT duration_ms, duration_min FROM rides
LIMIT 5;

Another valuable pattern is to use date functions to extract particular time intervals, such as months or days.  Every RDBMS has its own set of date functions, unfortunately you will likely just have to learn the ones used by the system in your environment.

Read more in the [documentation for PostgreSQL date formatting](https://www.postgresql.org/docs/9.5/static/functions-formatting.html).

In [None]:
%%sql
SELECT EXTRACT(DAY FROM start_date)::integer AS day, 
       EXTRACT(MONTH FROM start_date)::integer AS month, 
       EXTRACT(YEAR FROM start_date)::integer AS year
FROM rides
LIMIT 10;

In data warehouse models and in statistical model feature engineering, it can be particularly useful to extract all kinds of parts of dates out into variables.  You never know where you'll find significance.

This kind of extraction is quite common.

In [None]:
%%sql
SELECT DISTINCT TO_CHAR(start_date, 'YYYY-MM-DD HH24:00:00') AS hour,
    TO_CHAR(start_date, 'YYYY-MM-DD') AS day, 
    TO_CHAR(start_date, 'YYYY') AS year,
    TO_CHAR(start_date, 'MM') AS month,
    TO_CHAR(start_date, 'DD') AS day_of_month,
    TO_CHAR(start_date, 'Day') AS day_of_week_str,
    TO_CHAR(start_date, 'D') AS day_of_week,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) >= 6 
        THEN 1 
        ELSE 0
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) < 6 
        THEN 1 
        ELSE 0
    END AS is_weekday,
    TO_CHAR(start_date, 'HH24') AS hour_of_day,
    TO_CHAR(start_date, 'Q') AS quarter
FROM rides
LIMIT 10;

In [None]:
%%sql
DROP TABLE IF EXISTS hours;
CREATE TABLE hours (
    id SERIAL PRIMARY KEY,
    hour CHAR(19),
    day CHAR(10),
    year INTEGER,
    month INTEGER,
    day_of_month INTEGER,
    day_of_week_str CHAR(9),
    day_of_week INTEGER,
    is_weekend BOOLEAN,
    is_weekday BOOLEAN,
    hour_of_day INTEGER,
    quarter INTEGER
);

In [None]:
%%sql
INSERT INTO hours (hour, day, year, month, day_of_month, day_of_week_str, day_of_week,
                  is_weekend, is_weekday, hour_of_day, quarter)
SELECT DISTINCT TO_CHAR(start_date, 'YYYY-MM-DD HH24:00:00') AS hour,
    TO_CHAR(start_date, 'YYYY-MM-DD') AS day, 
    CAST(TO_CHAR(start_date, 'YYYY') AS INTEGER) AS year,
    CAST(TO_CHAR(start_date, 'MM') AS INTEGER) AS month,
    CAST(TO_CHAR(start_date, 'DD') AS INTEGER) AS day_of_month,
    TO_CHAR(start_date, 'Day') AS day_of_week_str,
    CAST(TO_CHAR(start_date, 'D') AS INTEGER) AS day_of_week,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) NOT IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekday,
    CAST(TO_CHAR(start_date, 'HH24') AS INTEGER) AS hour_of_day,
    CAST(TO_CHAR(start_date, 'Q') AS INTEGER) AS quarter
FROM rides
UNION
SELECT DISTINCT TO_CHAR(end_date, 'YYYY-MM-DD HH24:00:00') AS hour,
    TO_CHAR(end_date, 'YYYY-MM-DD') AS day, 
    CAST(TO_CHAR(end_date, 'YYYY') AS INTEGER) AS year,
    CAST(TO_CHAR(end_date, 'MM') AS INTEGER) AS month,
    CAST(TO_CHAR(end_date, 'DD') AS INTEGER) AS day_of_month,
    TO_CHAR(end_date, 'Day') AS day_of_week_str,
    CAST(TO_CHAR(end_date, 'D') AS INTEGER) AS day_of_week,
    CASE WHEN CAST(TO_CHAR(end_date, 'D') AS INTEGER) IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(end_date, 'D') AS INTEGER) NOT IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekday,
    CAST(TO_CHAR(end_date, 'HH24') AS INTEGER) AS hour_of_day,
    CAST(TO_CHAR(end_date, 'Q') AS INTEGER) AS quarter
FROM rides;

In [None]:
%%sql
SELECT * FROM hours
LIMIT 10;

And let's make sure we got that weekend bit right:

In [None]:
%%sql
SELECT DISTINCT day_of_week_str, day_of_week, is_weekend, is_weekday FROM hours;

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN start_hour_id INTEGER,
ADD CONSTRAINT fk_start_hour
FOREIGN KEY (start_hour_id)
REFERENCES hours (id);

In [None]:
%%sql
UPDATE rides AS r
SET start_hour_id = h.id
FROM hours AS h
WHERE TO_CHAR(r.start_date, 'YYYY-MM-DD HH24:00:00') = h.hour;

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN end_hour_id INTEGER,
ADD CONSTRAINT fk_end_hour
FOREIGN KEY (end_hour_id)
REFERENCES hours (id);

In [None]:
%%sql
UPDATE rides AS r
SET end_hour_id = h.id
FROM hours AS h
WHERE TO_CHAR(r.end_date, 'YYYY-MM-DD HH24:00:00') = h.hour;

In [None]:
%%sql
SELECT rides.start_date, rides.end_date, s_hours.hour AS start_hour, e_hours.hour AS end_hour
FROM rides
JOIN hours AS s_hours
  ON s_hours.id = rides.start_hour_id
JOIN hours AS e_hours
  ON e_hours.id = rides.end_hour_id
LIMIT 10;

In [None]:
%%sql
SELECT day_of_week_str, COUNT(*) count
FROM rides, hours
WHERE rides.start_hour_id = hours.id
GROUP BY day_of_week_str
ORDER BY count DESC;

## Adding weather data

An interesting dimension to the bikeshare history is weather - I know I don't like to ride in the rain.  I'm probably not the only one.

Weather Underground offers access to weather history data at links like https://www.wunderground.com/history/airport/KDCA/2017/1/18/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999.  

You can also download data in CSV format. For example: https://www.wunderground.com/history/airport/KDCA/2017/1/18/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999&format=1

Mmm, CSV.  We know what to do with CSV.

In [None]:
from string import Template
import requests

In [None]:
url_template = Template('https://www.wunderground.com/history/airport/KDCA/$year/$month/$day/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999&format=1')
print(url_template.substitute(year=2017, month=1, day=18))

Let's write simple python code to download the weather data for the first quarter of 2017.

In [None]:
import calendar
year = 2017
for month in range(1, 4):
    days = calendar.monthrange(year, month)[1]
    for day in range(1, days+1):
        r = requests.get(url_template.substitute(year=year, month=month, day=day))
        print('Saving weather-%04d%02d%02d.csv' % (year, month, day))
        open('weather-%04d%02d%02d.csv' % (year, month, day), 'wb').write(r.content)

In [None]:
!head weather-20170125.csv | csvlook

Something is not right! Let's look at the raw content of the CSV file.

In [None]:
!head weather-20170125.csv

There are two issues:
1. The first line is blank
2. There are extra characters at the end of each line.

Let's fix them:

In [None]:
!sed 's/<br \/>//g;/^$/d' weather-20170125.csv | head | csvlook

Now it looks much better! Apply the fix to all weather CSV files.

In [None]:
!for f in weather-2017*.csv; do sed -i 's/<br \/>//g;/^$/d' ${f}; done

In [None]:
!csvstack weather-201701*.csv weather-201702*.csv weather-201703*.csv > weather-2017q1.csv

In [None]:
!csvstat weather-2017q1.csv

We've noticed special values such as `N/A`, `-` and `None`. We need to remove them so that they will be treated as NULL by the database.

In [None]:
!sed -i 's/,N\/A,/,,/g;s/,-,/,,/g;;s/,None,/,,/g' weather-2017q1.csv

Based on these values, I expect we can work with the following schema for weather:

In [None]:
%%sql
DROP TABLE IF EXISTS weather;
CREATE TABLE weather (
    id SERIAL PRIMARY KEY,
    time_str VARCHAR(8),
    temp NUMERIC,
    dew_point NUMERIC,
    humidity NUMERIC,
    pressure NUMERIC,
    visibility NUMERIC,
    wind_dir VARCHAR(8),
    wind_speed VARCHAR(10),
    gust_speed NUMERIC,
    precipitation NUMERIC,
    events VARCHAR(50),
    conditions VARCHAR(50),
    wind_dir_degrees NUMERIC,
    time_utc TIMESTAMPTZ,
    time TIMESTAMP
)

Now we'll load the data into PostgreSQL. Note that this requires the use of an absolute path, so adjust it to your location:

In [None]:
!pwd

In [None]:
%%sql
COPY weather 
(time_str, temp, dew_point, humidity, pressure, visibility, wind_dir, wind_speed, gust_speed, precipitation, events, conditions, wind_dir_degrees, time_utc)
FROM '/home/jovyan/work/gwu/2017/lectures/star-week-01/weather-2017q1.csv'
CSV
HEADER
QUOTE '"'
DELIMITER ',';

In [None]:
%%sql
SELECT * from weather LIMIT 10;

Next, we need to convert UTC time to EST or EDT. We know Daylight Saving Time started on Sunday, March 12, 2017, 2:00:00 am. The conversion takes two steps:

First we convert UTC times to EST times and populate `time` attribute for all `time_utc` values before `2017-03-12 07:00:00+00:00`, which is Sunday, March 12, 2017, 2:00:00 am. 

In [None]:
%%sql
UPDATE weather SET time = time_utc AT TIME ZONE 'EST'
WHERE time_utc <= '2017-03-12 07:00:00+00:00';

Next we convert UTC times to EDT times and populate `time` attribute for all `time_utc` values after `2017-03-12 07:00:00+00:00`, which is Sunday, March 12, 2017, 2:00:00 am. 

In [None]:
%%sql
UPDATE weather SET time = time_utc AT TIME ZONE 'EDT'
WHERE time_utc > '2017-03-12 07:00:00+00:00';

Verify that time attributes look okay on March 12:

In [None]:
%%sql
SELECT time_str, time from weather 
WHERE TO_CHAR(time, 'YYYY-MM-DD') = '2017-03-12'
ORDER BY time;

Now we add two foreign key columns to the `rides` table that reference `weather` dimension table.

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN start_weather_id INTEGER,
ADD CONSTRAINT fk_start_weather
FOREIGN KEY (start_weather_id)
REFERENCES weather (id);

In [None]:
%%sql
ALTER TABLE rides 
ADD COLUMN end_weather_id INTEGER,
ADD CONSTRAINT fk_end_weather
FOREIGN KEY (end_weather_id)
REFERENCES weather (id);

In [None]:
%%sql
UPDATE rides AS r
SET start_weather_id = w.id
FROM weather AS w
WHERE TO_CHAR(r.start_date, 'YYYY-MM-DD HH24') = TO_CHAR(w.time, 'YYYY-MM-DD HH24');

In [None]:
%%sql
UPDATE rides AS r
SET end_weather_id = w.id
FROM weather AS w
WHERE TO_CHAR(r.end_date, 'YYYY-MM-DD HH24') = TO_CHAR(w.time, 'YYYY-MM-DD HH24');

Pay attention to the rows affected. Column `end_weather_id` of some rides are not updated because the end times are beyond 2017-03-31, and we don't have weather data for them.

In [None]:
%%sql
SELECT COUNT(*) FROM rides
WHERE start_weather_id IS NULL;

In [None]:
%%sql
SELECT COUNT(*) FROM rides
WHERE end_weather_id IS NULL;

Let's find out under what weather conditions that people ride bikeshare.

In [None]:
%%sql
SELECT w.conditions, COUNT(*) count
FROM rides
JOIN weather AS w
ON w.id = rides.start_weather_id
GROUP BY w.conditions
ORDER BY count DESC;

In [None]:
%matplotlib inline

In [None]:
result = _
result.bar()