# Data Engineering Capstone Project

## Import all necessary libraries and packages

In [1]:
from python_modules.etl import *
import findspark
findspark.init()

## Set environment variables (AWS credentials) so Spark can access the S3 buckets

In [2]:
set_aws_credentials('./cfg/aws_credentials.cfg')

## Initialize SparkSession instance, with hadoop-aws package to process S3 buckets

In [3]:
spark = create_spark_session()

## Set input bucket to where the data resides in S3, and the output bucket to where we want to store the parquets

In [4]:
input_path = "s3a://tung99-bucket/"
output_path = "s3a://tung99-bucket/"

## Process each of the csv and json file into a DataFrame

After a brief inspection on the data files, these issues have been observed:
- Some of the entries in the ```nta_name``` column of the taxi zone lookup table contain more than one NTA name, making it impossible to match the names with the correct NTA code. To circumvent this, the problematic rows are split into identical ones using the ```/``` delimiter.
- Similarly, some NTA names were combined by ```-``` in the ```nta_codes.json``` file, requiring the use of splitting to separate the names.
While these fixes cannot guarantee that all location IDs will be mapped to an NTA code, it does significantly increases the number of matches (from 108 to 238 after the fixes).
- There is an empty row in the beginning of the trips file, but it will go away as we do the ```JOIN``` operations.

In [5]:
zone_df = create_zone_df(spark, input_path)
temp_df = create_temp_df(spark, input_path)
trips_df = create_trips_df(spark, input_path)
nta_df = create_nta_df(spark, input_path)

### Taxi zone lookup file (this connects the location ID to an NTA name)

In [6]:
zone_df.show(5)

+-----------+---------+--------------+------------+
|location_id|boro_name|      nta_name|service_zone|
+-----------+---------+--------------+------------+
|          1|      EWR|Newark Airport|         EWR|
|          2|   Queens|   Jamaica Bay|   Boro Zone|
|          3|    Bronx|      Allerton|   Boro Zone|
|          3|    Bronx|Pelham Gardens|   Boro Zone|
|          4|Manhattan| Alphabet City| Yellow Zone|
+-----------+---------+--------------+------------+
only showing top 5 rows



### Temperature information file

In [7]:
temp_df.show(5)

+---------+-----------+-------------------+----+-----------+------------+----+------------+---------+--------+-----+---+
|sensor_id|   air_temp|               date|hour|   latitude|   longitude|year|install_type|boro_name|nta_code|month|day|
+---------+-----------+-------------------+----+-----------+------------+----+------------+---------+--------+-----+---+
| Bk-BR_01|     71.189|2018-06-15 00:00:00|   1|40.66620508|-73.91691035|2018| Street Tree| Brooklyn|    BK81|    6| 15|
| Bk-BR_01|70.24333333|2018-06-15 00:00:00|   2|40.66620508|-73.91691035|2018| Street Tree| Brooklyn|    BK81|    6| 15|
| Bk-BR_01|69.39266667|2018-06-15 00:00:00|   3|40.66620508|-73.91691035|2018| Street Tree| Brooklyn|    BK81|    6| 15|
| Bk-BR_01|68.26316667|2018-06-15 00:00:00|   4|40.66620508|-73.91691035|2018| Street Tree| Brooklyn|    BK81|    6| 15|
| Bk-BR_01|     67.114|2018-06-15 00:00:00|   5|40.66620508|-73.91691035|2018| Street Tree| Brooklyn|    BK81|    6| 15|
+---------+-----------+---------

### Taxi trips file. For simplicity reason, only trips happening in July 2018 were included, so a lot of assumptions will be made later on.

In [8]:
trips_df.show(5)

+---------+-------------------+-------------------+---------------+-------------+-----------+------------------+--------------+--------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+-----+------+------+-------+-------+
|vendor_id|            PU_date|            DO_date|passenger_count|trip_distance|ratecode_id|store_and_fwd_flag|PU_location_id|DO_location_id|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|month|PU_day|DO_day|PU_hour|DO_hour|
+---------+-------------------+-------------------+---------------+-------------+-----------+------------------+--------------+--------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+-----+------+------+-------+-------+
|     null|               null|               null|           null|         null|       null|              null|          null|          null|        null|       null| n

### Process the nta_codes.json file. The NTA names that contain ```-``` are split into separate rows, but they have the same NTA code.

In [9]:
nta_df.show(5)

+---------+--------+-------------+
|boro_name|nta_code|     nta_name|
+---------+--------+-------------+
| Brooklyn|    BK88| Borough Park|
|   Queens|    QN51|  Murray Hill|
|   Queens|    QN27|East Elmhurst|
| Brooklyn|    BK23|West Brighton|
|   Queens|    QN41|Fresh Meadows|
+---------+--------+-------------+
only showing top 5 rows



## Convert information gathered from the above DataFrames into specific tables.
Writing to parquets should happen in this step, but to save processing time, it has been temporarily commented out in the original functions.

In [10]:
time_table = create_time_dim_table(spark, temp_df, output_path)
trips_table = create_trips_fact_table(spark, trips_df, time_table, output_path)
loc_table = create_loc_dim_table(spark, nta_df, zone_df, output_path)
temps_table = create_temps_fact_table(spark, temp_df, loc_table, output_path)

In [11]:
time_table.show(5)

+----+-----+---+-------+----+-------+
|year|month|day|weekday|hour|time_id|
+----+-----+---+-------+----+-------+
|2018|    7|  4|      4|   2|      0|
|2018|    7|  4|      4|  17|      1|
|2018|    7|  5|      5|  11|      2|
|2018|    7|  6|      6|  12|      3|
|2018|    7| 25|      4|   0|      4|
+----+-----+---+-------+----+-------+
only showing top 5 rows



In [12]:
trips_table.show(5)

+---------+------------+-------------+---------------+-------------+--------------+--------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|vendor_id|  PU_date_id|   DO_date_id|passenger_count|trip_distance|PU_location_id|DO_location_id|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|
+---------+------------+-------------+---------------+-------------+--------------+--------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+
|        2|283467841536|1125281431552|              3|         1.28|           246|           234|           1|        7.5|  0.5|    0.5|      1.76|         0.0|                  0.3|       10.56|
|        2| 25769803778|1125281431552|              1|         2.01|           164|            79|           2|        8.0|  0.5|    0.5|       0.0|         0.0|                  0.3|         9.3|
|        2| 257

In [13]:
loc_table.show(5)

+-----------+-------------+--------+-------------+------------+
|location_id|    boro_name|nta_code|     nta_name|service_zone|
+-----------+-------------+--------+-------------+------------+
|         26|     Brooklyn|    BK88| Borough Park|   Boro Zone|
|        170|    Manhattan|    QN51|  Murray Hill| Yellow Zone|
|         70|       Queens|    QN27|East Elmhurst|   Boro Zone|
|        245|Staten Island|    BK23|West Brighton|   Boro Zone|
|         98|       Queens|    QN41|Fresh Meadows|   Boro Zone|
+-----------+-------------+--------+-------------+------------+
only showing top 5 rows



In [14]:
temps_table.show(5)

+-------------+-----------+-----------+------------+
|      time_id|location_id|   air_temp|install_type|
+-------------+-----------+-----------+------------+
|1022202216448|         75|73.18433333| Street Tree|
|1022202216448|         75|72.95483333| Street Tree|
|1022202216448|         75|71.95583333| Street Tree|
|1022202216448|         75|73.24216667| Street Tree|
|1022202216448|         75|    72.7315| Street Tree|
+-------------+-----------+-----------+------------+
only showing top 5 rows



## Data quality checks happen after all tables have been made:
- Since the data only comes from the month of July, there should be exactly 744 entries corresponding to 31 days of July (and each day contains 24 hours).
- All the tables should not contain any ```NULL```s, as there was only one ```NULL``` row in the beginning, which should be eliminated after the ```JOIN``` on the ```trips``` table.

In [15]:
if not table_rows_check(time_table, 31*24):
    raise ValueError('Some hours are missing!')
else:
    print('All hours are present')

table_list = [trips_table, time_table, temps_table, loc_table]
if not null_values_check(table_list):
    raise ValueError('Null value(s) found!')
else:
    print('No null values found.')

All hours are present
No null values found.
