# Snapshot Fact Table

This week we'll build up a very simple snapshot fact table using the weather data.

## Setup - install PostgreSQL (Optional)

We are going to use [PostgreSQL](https://www.postgresql.org) this time. If you are using datanotebook.org or AWS EC2 instances based on our AMI, you can skip this section. If PostgreSQL is not installed, follow the [instructions](https://www.postgresql.org/download/linux/) to install it.

If PostgreSQL server is not running, execute the following cell to start it:

In [None]:
!sudo /etc/init.d/postgresql start

In order to connect to PostgreSQL, we need to make sure [ipython-sql](https://github.com/catherinedevlin/ipython-sql) and [psycopg2](https://github.com/psycopg/psycopg2) libraries are installed.

In [None]:
!pip install ipython-sql psycopg2

##  Weather snapshots

We'll download the same weather data you've worked with before, and we'll create some database tables using PostgreSQL.

In [None]:
%load_ext sql

In [None]:
!createdb week10

In [None]:
%sql postgresql:///week10

Weather Underground offers access to weather history data at links like https://www.wunderground.com/history/airport/KDCA/2017/1/18/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999.  

You can also download data in CSV format. For example: https://www.wunderground.com/history/airport/KDCA/2017/1/18/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999&format=1

Let's write simple python code to download the weather data for the first quarter of 2017.

In [None]:
from string import Template
import requests
import calendar
year = 2017
url_template = Template('https://www.wunderground.com/history/airport/KDCA/$year/$month/$day/DailyHistory.html?req_city=Washington&req_state=DC&req_statename=District+of+Columbia&reqdb.zip=20003&reqdb.magic=1&reqdb.wmo=99999&format=1')
for month in range(1, 4):
    days = calendar.monthrange(year, month)[1]
    for day in range(1, days+1):
        r = requests.get(url_template.substitute(year=year, month=month, day=day))
        print('Saving weather-%04d%02d%02d.csv' % (year, month, day))
        open('weather-%04d%02d%02d.csv' % (year, month, day), 'wb').write(r.content)

In [None]:
!head weather-20170125.csv | csvlook

Something is not right! Let's look at the raw content of the CSV file.

In [None]:
!head weather-20170125.csv

There are two issues:
1. The first line is blank
2. There are extra characters at the end of each line.

Let's fix them:

In [None]:
!sed 's/<br \/>//g;/^$/d' weather-20170125.csv | head | csvlook

Now it looks much better! Apply the fix to all weather CSV files.

In [None]:
!for f in weather-2017*.csv; do sed -i 's/<br \/>//g;/^$/d' ${f}; done

In [None]:
!csvstack weather-201701*.csv weather-201702*.csv weather-201703*.csv > weather-2017q1.csv

In [None]:
!csvstat weather-2017q1.csv

We've noticed special values such as `N/A`, `-` and `None`. We need to remove them so that they will be treated as NULL by the database.

In [None]:
!sed -i 's/,N\/A,/,,/g;s/,-,/,,/g;;s/,None,/,,/g' weather-2017q1.csv

Based on these values, I expect we can work with the following schema for weather:

In [None]:
%%sql
DROP TABLE IF EXISTS weather;
CREATE TABLE weather (
    id SERIAL PRIMARY KEY,
    time_str VARCHAR(8),
    temp NUMERIC,
    dew_point NUMERIC,
    humidity NUMERIC,
    pressure NUMERIC,
    visibility NUMERIC,
    wind_dir VARCHAR(8),
    wind_speed VARCHAR(10),
    gust_speed NUMERIC,
    precipitation NUMERIC,
    events VARCHAR(50),
    conditions VARCHAR(50),
    wind_dir_degrees NUMERIC,
    time_utc TIMESTAMPTZ,
    time TIMESTAMP
)

Now we'll load the data into PostgreSQL. Note that this requires the use of an absolute path, so adjust it to your location:

In [None]:
!pwd

In [None]:
%%sql
COPY weather 
(time_str, temp, dew_point, humidity, pressure, visibility, wind_dir, wind_speed, gust_speed, precipitation, events, conditions, wind_dir_degrees, time_utc)
FROM '/home/jovyan/work/gwu/2017/lectures/star-week-03/weather-2017q1.csv'
CSV
HEADER
QUOTE '"'
DELIMITER ',';

In [None]:
%%sql
SELECT * from weather LIMIT 10;

Next, we need to convert UTC time to EST or EDT. We know Daylight Saving Time started on Sunday, March 12, 2017, 2:00:00 am. The conversion takes two steps:

First we convert UTC times to EST times and populate `time` attribute for all `time_utc` values before `2017-03-12 07:00:00+00:00`, which is Sunday, March 12, 2017, 2:00:00 am. 

In [None]:
%%sql
UPDATE weather SET time = time_utc AT TIME ZONE 'EST'
WHERE time_utc <= '2017-03-12 07:00:00+00:00';

Next we convert UTC times to EDT times and populate `time` attribute for all `time_utc` values after `2017-03-12 07:00:00+00:00`, which is Sunday, March 12, 2017, 2:00:00 am. 

In [None]:
%%sql
UPDATE weather SET time = time_utc AT TIME ZONE 'EDT'
WHERE time_utc > '2017-03-12 07:00:00+00:00';

Verify that time attributes look okay on March 12:

In [None]:
%%sql
SELECT time_str, time from weather 
WHERE TO_CHAR(time, 'YYYY-MM-DD') = '2017-03-12'
ORDER BY time;

### Create dimension table
We are going to create a dimension table called `Hours` to store hours in the first quarter of 2017.

In [None]:
%%sql
SELECT TO_CHAR(time, 'YYYY-MM-DD HH24') AS hour,
    TO_CHAR(time, 'YYYY-MM-DD') AS day, 
    CAST(TO_CHAR(time, 'YYYY') AS INTEGER) AS year,
    CAST(TO_CHAR(time, 'MM') AS INTEGER) AS month,
    CAST(TO_CHAR(time, 'DD') AS INTEGER) AS day_of_month,
    TO_CHAR(time, 'Day') AS day_of_week_str,
    CAST(TO_CHAR(time, 'D') AS INTEGER) AS day_of_week,
    CASE WHEN CAST(TO_CHAR(time, 'D') AS INTEGER) IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(time, 'D') AS INTEGER) NOT IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekday,
    CAST(TO_CHAR(time, 'HH24') AS INTEGER) AS hour_of_day,
    CAST(TO_CHAR(time, 'Q') AS INTEGER) AS quarter
FROM weather
LIMIT 20;

In [None]:
%%sql
DROP TABLE IF EXISTS hours;
CREATE TABLE hours (
    id SERIAL PRIMARY KEY,
    hour CHAR(13),
    day CHAR(10),
    year INTEGER,
    month INTEGER,
    day_of_month INTEGER,
    day_of_week_str CHAR(9),
    day_of_week INTEGER,
    is_weekend BOOLEAN,
    is_weekday BOOLEAN,
    hour_of_day INTEGER,
    quarter INTEGER
);

In [None]:
%%sql
INSERT INTO hours (hour, day, year, month, day_of_month, day_of_week_str, day_of_week,
                  is_weekend, is_weekday, hour_of_day, quarter)
SELECT DISTINCT TO_CHAR(time, 'YYYY-MM-DD HH24') AS hour,
    TO_CHAR(time, 'YYYY-MM-DD') AS day, 
    CAST(TO_CHAR(time, 'YYYY') AS INTEGER) AS year,
    CAST(TO_CHAR(time, 'MM') AS INTEGER) AS month,
    CAST(TO_CHAR(time, 'DD') AS INTEGER) AS day_of_month,
    TO_CHAR(time, 'Day') AS day_of_week_str,
    CAST(TO_CHAR(time, 'D') AS INTEGER) AS day_of_week,
    CASE WHEN CAST(TO_CHAR(time, 'D') AS INTEGER) IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(time, 'D') AS INTEGER) NOT IN (1, 7) 
        THEN TRUE
        ELSE FALSE
    END AS is_weekday,
    CAST(TO_CHAR(time, 'HH24') AS INTEGER) AS hour_of_day,
    CAST(TO_CHAR(time, 'Q') AS INTEGER) AS quarter
FROM weather;

In [None]:
%%sql
SELECT * FROM hours LIMIT 20;

### Create snapshot table

In [None]:
%%sql
DROP TABLE IF EXISTS weather_fact;
CREATE TABLE weather_fact (
    id INTEGER,
    temp NUMERIC,
    dew_point NUMERIC,
    humidity NUMERIC,
    pressure NUMERIC,
    visibility NUMERIC,
    wind_dir VARCHAR(8),
    wind_speed VARCHAR(10),
    gust_speed NUMERIC,
    precipitation NUMERIC,
    events VARCHAR(50),
    conditions VARCHAR(50),
    wind_dir_degrees NUMERIC,
    hour_id INTEGER REFERENCES hours (id)
);

Populate the snapshot fact table with the data from `weather` table and `hours` table.

In [None]:
%%sql
INSERT INTO weather_fact
SELECT w.id, w.temp, w.dew_point, w.humidity, w.pressure, w.visibility, w.wind_dir, 
       w.wind_speed, w.gust_speed, w.precipitation, w.events, w.conditions, 
       w.wind_dir_degrees, h.id
FROM weather AS w, hours AS h
WHERE h.hour = TO_CHAR(w.time, 'YYYY-MM-DD HH24');

Make sure weather is sample only once per hour. Let's query `weather_fact` table to see if there is any hour that has multiple weather readings.

In [None]:
%%sql
SELECT hour_id, COUNT(*) FROM weather_fact 
GROUP BY hour_id
HAVING COUNT(*) > 1
LIMIT 10;

Apparently some hours have more than one reading. For example hour `541` has 7 readings:

In [None]:
%%sql
SELECT * FROM weather_fact
WHERE hour_id = 541;

Only keep the first weather reading for each hour by removing all subsequent readings within the same hour:

In [None]:
%%sql
DELETE FROM weather_fact t1
  USING weather_fact t2
  WHERE t2.hour_id = t1.hour_id
  AND t1.id > t2.id;

In [None]:
%%sql
SELECT hour_id, COUNT(*) FROM weather_fact 
GROUP BY hour_id
HAVING COUNT(*) > 1
LIMIT 10;

In [None]:
%%sql
SELECT COUNT(*) FROM hours;

In [None]:
%%sql
SELECT COUNT(*) FROM weather_fact;

How may hours in the first quarter of 2017?

In [None]:
24 * (31 + 28 + 31)

Three hours are missing from `hours` dimension table:

In [None]:
%%sql
SELECT day, COUNT(*)
FROM hours
GROUP BY day
HAVING COUNT(*) < 24;

We know we lost one hour on 2017-03-12 as we "spring forward". Let's look at what happened on 2017-03-14:

In [None]:
%%sql
SELECT id, time
FROM weather
WHERE TO_CHAR(time, 'YYYY-MM-DD') = '2017-03-14';

There is no reading for hour 16 on that day!

We don't need `id` attribute for fact table, let's drop it:

In [None]:
%%sql
ALTER TABLE weather_fact DROP column id;

Now it is time to explore the data:

In [None]:
%%sql
SELECT h.hour_of_day, AVG(temp)
FROM weather_fact AS w
JOIN hours AS h
ON h.id = w.hour_id
WHERE h.month = 3
GROUP BY h.hour_of_day
ORDER BY hour_of_day;

In [None]:
%matplotlib inline

In [None]:
result = _
result.bar()