# Introduction

This notebook is the ETL for the yellow and green NYC taxi data for the years 2019-2020.

The data has been downloaded separately from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page and stored in `../data/raw/`. The file names are in the format `<colour>_tripdata_yyyy-mm.csv`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from dotenv import find_dotenv
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark import SparkContext

import pyspark.sql.functions as F
from pyspark.sql.functions import when
from pyspark.sql.types import (
    IntegerType,
    DateType,
    FloatType,
    StringType,
    TimestampType 
)

# from src.data.utils import count_missing

In [3]:
project_dir = Path().cwd().parent
data_dir = project_dir / 'data'
raw_data_dir = data_dir / 'raw'
interim_data_dir = data_dir / 'interim'
processed_data_dir = data_dir / 'processed'
reports_dir = project_dir / 'reports'

In [4]:
spark = (
    SparkSession
    .builder
    .master('local[12]')
    .appName('new_york_taxis')
    .getOrCreate()
)

In [5]:
spark

In [6]:
spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()

'3.0.0'

In [7]:
spark.sparkContext._conf.getAll()

[('spark.driver.memory', '12g'),
 ('spark.driver.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.executor.id', 'driver'),
 ('spark.driver.port', '36607'),
 ('spark.app.id', 'local-1619237817701'),
 ('spark.app.name', 'new_york_taxis'),
 ('spark.executor.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.master', 'local[12]'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.host', '7501c46205ca'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.debug.maxToStringFields', '1000')]

In [8]:
conf = spark.sparkContext._conf.setAll([
    ('spark.driver.memory', '16g'),
    ('spark.executor.memory', '16g'),
    ('spark.app.name', 'new_york_taxis'),
])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()

In [9]:
spark.sparkContext._conf.getAll()

[('spark.driver.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.executor.id', 'driver'),
 ('spark.driver.memory', '16g'),
 ('spark.executor.memory', '16g'),
 ('spark.driver.port', '36607'),
 ('spark.app.id', 'local-1619237817701'),
 ('spark.app.name', 'new_york_taxis'),
 ('spark.executor.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.master', 'local[12]'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.host', '7501c46205ca'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.debug.maxToStringFields', '1000')]

# Load data

The green and yellow taxi data are loaded together, which means that the extra feature, `Trip_type` in the green data set is omitted.

In [10]:
df_dict = {}

for colour in ['green', 'yellow']:
    path = raw_data_dir.joinpath(f'{colour}_tripdata_20*.csv').as_posix()
    df = spark.read.csv(path, header=True)
    
    # Add the taxi colour
    df = df.withColumn('colour', F.lit(colour))
    df_dict[colour] = df

# Check column differences

The green taxi data has more columns than does the yellow taxi data. See the table below:

|green columns|yellow columns|comment|
|-------------|--------------|-------|
|`lpep_pickup_datetime`|`pickup_datetime`|Rename to `pickup_datetime`|
|`lpep_dropoff_datetime`|`dropoff_datetime`|Rename to `dropoff_datetime`|
|`trip_type`||The green taxis have two types, "Street-hail" and "Dispatch". Drop this column.|
|`ehail_fee`||Drop this column.|

In [11]:
set(df_dict['green'].columns) - set(df_dict['yellow'].columns)

{'ehail_fee', 'lpep_dropoff_datetime', 'lpep_pickup_datetime', 'trip_type'}

In [12]:
set(df_dict['yellow'].columns) - set(df_dict['green'].columns)

{'tpep_dropoff_datetime', 'tpep_pickup_datetime'}

# Combine the green and the yellow

This requires the columns to be the same; renaming and dropping of columns is required, see [table](#Check-column-differences).

In [13]:
df_green = (
    df_dict['green']
    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime')
    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')
    .drop('trip_type')
    .drop('ehail_fee')
)

df_yellow = (
    df_dict['yellow']
    .withColumnRenamed('pickup_datetime', 'pickup_datetime')
    .withColumnRenamed('dropoff_datetime', 'dropoff_datetime')
)

df = df_green.union(df_yellow)

## Load/save parquet

In [14]:
path = interim_data_dir.joinpath('df_combined').as_posix()

In [15]:
# df.write.parquet(path, mode='overwrite')
df = spark.read.parquet(path)

In [16]:
df.count()

116825619

In [17]:
df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- pickup_datetime: string (nullable = true)
 |-- dropoff_datetime: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- congestion_surcharge: string (nullable = true)
 |-- colour: string (nullable = true)



# Change datatype

The columns are all of type string. This section changes some of them into more appropriate types.

In [18]:
datatype_dict = {
    'VendorID': StringType(),
    'pickup_datetime': TimestampType(),
    'dropoff_datetime': TimestampType(),
    'passenger_count': IntegerType(),
    'trip_distance': FloatType(),
    'RatecodeID': StringType(),
    'store_and_fwd_flag': StringType(),
    'PULocationID': StringType(),
    'DOLocationID': StringType(),
    'payment_type': StringType(),
    'fare_amount': FloatType(),
    'extra': FloatType(),
    'mta_tax': FloatType(),
    'tip_amount': FloatType(),
    'tolls_amount': FloatType(),
    'improvement_surcharge': FloatType(),
    'total_amount': FloatType(),
    'congestion_surcharge': FloatType()
}

In [19]:
# For convenience so all the code doesn't have to be manually typed
for key, value in datatype_dict.items():
    print(f".withColumn('{key}', F.col('{key}').astype({value}()))")

.withColumn('VendorID', F.col('VendorID').astype(StringType()))
.withColumn('pickup_datetime', F.col('pickup_datetime').astype(TimestampType()))
.withColumn('dropoff_datetime', F.col('dropoff_datetime').astype(TimestampType()))
.withColumn('passenger_count', F.col('passenger_count').astype(IntegerType()))
.withColumn('trip_distance', F.col('trip_distance').astype(FloatType()))
.withColumn('RatecodeID', F.col('RatecodeID').astype(StringType()))
.withColumn('store_and_fwd_flag', F.col('store_and_fwd_flag').astype(StringType()))
.withColumn('PULocationID', F.col('PULocationID').astype(StringType()))
.withColumn('DOLocationID', F.col('DOLocationID').astype(StringType()))
.withColumn('payment_type', F.col('payment_type').astype(StringType()))
.withColumn('fare_amount', F.col('fare_amount').astype(FloatType()))
.withColumn('extra', F.col('extra').astype(FloatType()))
.withColumn('mta_tax', F.col('mta_tax').astype(FloatType()))
.withColumn('tip_amount', F.col('tip_amount').astype(FloatType())

In [20]:
df_typed = (
    df
    .withColumn('VendorID', F.col('VendorID').astype(StringType()))
    .withColumn('pickup_datetime', F.col('pickup_datetime').astype(TimestampType()))
    .withColumn('dropoff_datetime', F.col('dropoff_datetime').astype(TimestampType()))
    .withColumn('passenger_count', F.col('passenger_count').astype(IntegerType()))
    .withColumn('trip_distance', F.col('trip_distance').astype(FloatType()))
    .withColumn('RatecodeID', F.col('RatecodeID').astype(StringType()))
    .withColumn('store_and_fwd_flag', F.col('store_and_fwd_flag').astype(StringType()))
    .withColumn('PULocationID', F.col('PULocationID').astype(StringType()))
    .withColumn('DOLocationID', F.col('DOLocationID').astype(StringType()))
    .withColumn('payment_type', F.col('payment_type').astype(StringType()))
    .withColumn('fare_amount', F.col('fare_amount').astype(FloatType()))
    .withColumn('extra', F.col('extra').astype(FloatType()))
    .withColumn('mta_tax', F.col('mta_tax').astype(FloatType()))
    .withColumn('tip_amount', F.col('tip_amount').astype(FloatType()))
    .withColumn('tolls_amount', F.col('tolls_amount').astype(FloatType()))
    .withColumn('improvement_surcharge', F.col('improvement_surcharge').astype(FloatType()))
    .withColumn('total_amount', F.col('total_amount').astype(FloatType()))
    .withColumn('congestion_surcharge', F.col('congestion_surcharge').astype(FloatType()))
)

In [21]:
df_typed.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- extra: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- improvement_surcharge: float (nullable = true)
 |-- total_amount: float (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- congestion_surcharge: float (nullable = true)
 |-- colour: string (nullable = true)



# Extract dateparts

1. `pickup_year`
1. `pickup_month`
1. `pickup_dayofyear`
1. `pickup_dayofmonth`
1. `pickup_dayofweek`
1. `pickup_weekofyear`
1. `pickup_hour`

In [22]:
hour = F.udf(lambda x: x.hour, IntegerType())

In [23]:
df_dateparts = (
    df_typed
    .withColumn('pickup_year', F.year('pickup_datetime'))
    .withColumn('pickup_month', F.month('pickup_datetime'))
    .withColumn('pickup_dayofyear', F.dayofyear('pickup_datetime'))
    .withColumn('pickup_dayofmonth', F.dayofmonth('pickup_datetime'))
    .withColumn('pickup_dayofweek', F.dayofweek('pickup_datetime'))
    .withColumn('pickup_weekofyear', F.weekofyear('pickup_datetime'))
    .withColumn('pickup_hourofday', hour('pickup_datetime'))
)

In [24]:
#
df_dateparts.rdd.getNumPartitions()

21

In [25]:
#
df_dateparts.limit(10).show()

+--------+-------------------+-------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------------------+------------+------------+--------------------+------+-----------+------------+----------------+-----------------+----------------+-----------------+----------------+
|VendorID|    pickup_datetime|   dropoff_datetime|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|payment_type|congestion_surcharge|colour|pickup_year|pickup_month|pickup_dayofyear|pickup_dayofmonth|pickup_dayofweek|pickup_weekofyear|pickup_hourofday|
+--------+-------------------+-------------------+------------------+----------+------------+------------+---------------+-------------+-----------+-----+-------+----------+------------+---------------------+------------+------------+----------------

In [26]:
path = interim_data_dir.joinpath('df_dateparts').as_posix()
n_partitions = 12 * 10

In [27]:
df_dateparts = spark.read.parquet(path)

# Count distinct

There are categorical columns with defined levels, these are:
* `VendorID`
* `RatecodeID`
* `store_and_fwd_flag`
* `PULocationID`
* `DOLocationID`
* `payment_type`

There are numerical columns with limited distinct values, these are:
* `Extra`: \\$0.50 or \\$1
* `MTA_tax`: \\$0.50
* `Improvement_surcharge`: \\$0.30


In [28]:
string_cols = [col for (col, col_type) in df_dateparts.dtypes if col_type == 'string']
string_cols

['VendorID',
 'store_and_fwd_flag',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'colour']

In [29]:
lim_num_cols = ['Extra', 'MTA_tax', 'Improvement_surcharge']

In [30]:
expression = [F.countDistinct(F.col(col)).alias(col) for col in string_cols + lim_num_cols]
df_dateparts.select(*expression).show()

+--------+------------------+----------+------------+------------+------------+------+-----+-------+---------------------+
|VendorID|store_and_fwd_flag|RatecodeID|PULocationID|DOLocationID|payment_type|colour|Extra|MTA_tax|Improvement_surcharge|
+--------+------------------+----------+------------+------------+------------+------+-----+-------+---------------------+
|       3|                12|     10307|         262|         263|       26639|     2|13215|    455|                 3767|
+--------+------------------+----------+------------+------------+------------+------+-----+-------+---------------------+



# Data validation

## String type columns

* [VendorID](#Validation:-VendorID)
* [PULocationID](#Validation:-PULocationID)
* [DOLocationID](#Validation:-DOLocationID)
* [RatecodeID](#Validation:-RatecodeID)
* [Store_and_fwd_flag](#Validation:-Store_and_fwd_flag)
* [Payment_type](#Validation:-Payment_type)

### Validation: `VendorID`

In [31]:
(
    df_dateparts
    .groupBy('VendorID')
    .count()
    .show()
)

+--------+--------+
|VendorID|   count|
+--------+--------+
|    null| 1998368|
|       1|39397979|
|       4|  267080|
|       2|75162192|
+--------+--------+



### Validation: `PULocationID`

In [32]:
(
    df_dateparts
    .groupBy('PULocationID')
    .count()
    .show()
)

+------------+------+
|PULocationID| count|
+------------+------+
|         125|   174|
|           7|295919|
|          51| 25600|
|         124|  3860|
|         205| 18742|
|         169| 16771|
|          15|  2535|
|          54|  2504|
|         232|  1740|
|         234|   621|
|         155| 14752|
|         132|  1263|
|         154|   517|
|         200|  6699|
|         101|  3004|
|          11|  5005|
|         138|  1201|
|          69| 33691|
|          29| 12508|
|          42|276845|
+------------+------+
only showing top 20 rows



### Validation: `DOLocationID`

In [33]:
(
    df_dateparts
    .groupBy('DOLocationID')
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------+---------+
|DOLocationID|    count|
+------------+---------+
|           N|107054891|
|        null|  1056169|
|           Y|   936458|
|          74|   297891|
|          42|   278393|
|          41|   243808|
|          75|   220532|
|         129|   185505|
|           7|   182808|
|         166|   160030|
|         181|   125980|
|          82|   124583|
|         236|   122205|
|          95|   118059|
|         238|   116732|
|         244|   115191|
|         223|   114151|
|          61|   112787|
|         116|   108780|
|          97|   102322|
+------------+---------+
only showing top 20 rows



### Validation: `Store_and_fwd_flag`

There are values that are neither `Y` nor `N`. These will be converted to nulls.

In [34]:
(
    df_dateparts
    .groupBy('Store_and_fwd_flag')
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------------+--------+
|Store_and_fwd_flag|   count|
+------------------+--------+
|                 1|76620041|
|                 2|16134920|
|                 N| 6816744|
|                 3| 4456577|
|                 5| 4149931|
|                 6| 2513689|
|                 4| 2099896|
|                 0| 2015177|
|              null| 1998368|
|                 Y|   19158|
|                 7|     507|
|                 8|     335|
|                 9|     276|
+------------------+--------+



In [35]:
(
    df_dateparts
    .groupBy('Store_and_fwd_flag')
    .count()
    .sort(F.col('count').desc())
    .toPandas()
#     .show()
)

Unnamed: 0,Store_and_fwd_flag,count
0,1,76620041
1,2,16134920
2,N,6816744
3,3,4456577
4,5,4149931
5,6,2513689
6,4,2099896
7,0,2015177
8,,1998368
9,Y,19158


In [36]:
valid_values = ['N', 'Y', None]
col = 'Store_and_fwd_flag'
(
    df_dateparts
    .withColumn(col, 
                F.when(~F.col('Store_and_fwd_flag').isin(valid_values), 'null')
                .otherwise(F.col('Store_and_fwd_flag')))
    .groupBy(col)
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------------+--------+
|Store_and_fwd_flag|   count|
+------------------+--------+
|                 1|76620041|
|                 2|16134920|
|                 N| 6816744|
|                 3| 4456577|
|                 5| 4149931|
|                 6| 2513689|
|                 4| 2099896|
|                 0| 2015177|
|              null| 1998368|
|                 Y|   19158|
|                 7|     507|
|                 8|     335|
|                 9|     276|
+------------------+--------+



In [37]:
valid_values = ['N', 'Y', 'null']
col = 'Store_and_fwd_flag'
(
    df_dateparts
    .withColumn(col, 
                F.when(~F.col('Store_and_fwd_flag').isin(valid_values), 'null')
                .otherwise(F.col('Store_and_fwd_flag')))
    .groupBy(col)
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------------+---------+
|Store_and_fwd_flag|    count|
+------------------+---------+
|              null|107991349|
|                 N|  6816744|
|              null|  1998368|
|                 Y|    19158|
+------------------+---------+



In [38]:
valid_values = ['N', 'Y', 'null']
col = 'Store_and_fwd_flag'
df_validated_store = (
    df_dateparts
    .withColumn(col, 
                F.when(~F.col(col).isin(valid_values), None)
                .otherwise(F.col(col)))
)

In [39]:
(
    df_validated_store
    .groupBy(col)
    .count()
    .sort(F.col('count').desc())
    .limit(5)
    .show()
)

+------------------+---------+
|Store_and_fwd_flag|    count|
+------------------+---------+
|              null|109989717|
|                 N|  6816744|
|                 Y|    19158|
+------------------+---------+



### Validation: `Payment_type`

In [40]:
col = 'Payment_type'
valid_values = [1, 2, 3, 4, 5, 6]

In [41]:
(
    df_validated_store
    .groupBy(col)
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------+-------+
|Payment_type|  count|
+------------+-------+
|           1|3872073|
|           2|2911981|
|         9.8|2041935|
|        10.3|2030392|
|         9.3|2007399|
|        10.8|1987726|
|        11.3|1914398|
|         8.8|1909840|
|        11.8|1833901|
|        12.3|1732740|
|         8.3|1655049|
|        12.8|1647574|
|        13.3|1554777|
|        13.8|1443679|
|       12.36|1374202|
|       11.76|1355676|
|        14.3|1345537|
|         7.8|1329927|
|       12.96|1329377|
|        15.3|1316039|
+------------+-------+
only showing top 20 rows



In [42]:
df_validated_payment = (
    df_validated_store
    .withColumn(col, 
                F.when(~F.col(col).isin(valid_values), None)
                .otherwise(F.col(col)))
)

In [43]:
(
    df_validated_payment
    .groupBy(col)
    .count()
    .sort(F.col('count').desc())
    .show()
)

+------------+---------+
|Payment_type|    count|
+------------+---------+
|        null|109962876|
|           1|  3872073|
|           2|  2911981|
|           3|    37207|
|           6|    23614|
|           4|    14899|
|           5|     2969|
+------------+---------+



### Validation: `RatecodeID`

There are codes other than the defined 1, 2, 3, 4, 5, 6 in the data dictionary. Assume that values between 1 - 1.9999 are considered to be 1, similarly for the other integers.

In [44]:
col = 'RatecodeID'
valid_values = [1, 2, '2.00', 3, 4, 5, 6]

In [45]:
(
    df_validated_payment
    .groupBy(col)
    .count()
    .sort(F.col(col))
    .show(100)
)

+----------+------+
|RatecodeID| count|
+----------+------+
|      null|942199|
|      -.01|    14|
|      -.02|     4|
|      -.03|     6|
|      -.04|     2|
|      -.05|     8|
|      -.06|     4|
|      -.07|     6|
|      -.08|     6|
|      -.09|     4|
|      -.10|     4|
|      -.11|     6|
|      -.12|     1|
|      -.13|     4|
|      -.14|     4|
|      -.15|     7|
|      -.16|     5|
|      -.17|     3|
|      -.18|     2|
|      -.19|     7|
|      -.20|     8|
|      -.21|     2|
|      -.22|     9|
|      -.23|     4|
|      -.24|     6|
|      -.25|     5|
|      -.26|     3|
|      -.27|     6|
|      -.28|     3|
|      -.29|     3|
|      -.30|     6|
|      -.32|     5|
|      -.33|     3|
|      -.34|     5|
|      -.35|     8|
|      -.36|     9|
|      -.37|     6|
|      -.38|     3|
|      -.39|     5|
|      -.40|     4|
|      -.41|     7|
|      -.42|     5|
|      -.43|     5|
|      -.44|     8|
|      -.45|     5|
|      -.46|     6|
|      -.47|     5|


In [46]:
df_validated_ratecode = (
    df_validated_payment
    .withColumn(col, when(F.col(col).between(1, 1.9999), 1)
                     .when(F.col(col).between(2, 2.9999), 2)
                     .when(F.col(col).between(3, 3.9999), 3)
                     .when(F.col(col).between(4, 4.9999), 4)
                     .when(F.col(col).between(5, 5.9999), 5)
                     .when(F.col(col).between(6, 6.9999), 6)
                     .otherwise(None))
)

In [47]:
(
    df_validated_ratecode
    .groupBy(col)
    .count()
    .sort(F.col(col))
    .show()
)

+----------+--------+
|RatecodeID|   count|
+----------+--------+
|      null|39452839|
|         1|43148918|
|         2|16840024|
|         3| 7959398|
|         4| 4309732|
|         5| 3115976|
|         6| 1998732|
+----------+--------+



## Validate: `Total_amount`

`Total_amount` should be >= 0.

In [48]:
col = 'Total_amount'

In [49]:
(
    df_validated_ratecode
    .select(col)
    .describe()
    .show()
)

+-------+------------------+
|summary|      Total_amount|
+-------+------------------+
|  count|         116825619|
|   mean|1.4708037196553179|
| stddev| 5.802119768237885|
|    min|            -890.3|
|    max|            4012.3|
+-------+------------------+



In [50]:
df_validated_amount = (
    df_validated_payment
    .withColumn(col, F.abs(F.col(col)))
)

## Validate: `Fare_amount`

`Fare_amount` should be >= 0.

In [51]:
col = 'Fare_amount'

In [52]:
(
    df_validated_amount
    .select(col)
    .describe()
    .show()
)

+-------+-----------------+
|summary|      Fare_amount|
+-------+-----------------+
|  count|        115769450|
|   mean|2.208597679170043|
| stddev|4.896087930682976|
|    min|           -890.0|
|    max|           4011.5|
+-------+-----------------+



In [53]:
df_validated_fare = (
    df_validated_amount
    .withColumn(col, F.abs(F.col(col)))
)

## Validate: `pickup_datetime`

The data should only contain pickups that occur in years 2019 and 2020. Any rows where the year is not 2019 or 2020 will be dropped.

In [54]:
(
    df_validated_fare
    .withColumn('pickup_year', F.year('pickup_datetime'))
    .filter(~F.col('pickup_year').isin([2019, 2020]))
    .count()
)

1430

There are 1430 rows that are not in 2019 or 2020.

In [55]:
df_validated_pickup_year = (
    df_validated_fare
    .withColumn('pickup_year', F.year('pickup_datetime'))
    .filter(F.col('pickup_year').isin([2019, 2020]))
)

## Validate: `dropoff_datetime`

The data should only contain pickups that occur in years 2019 and 2020. Any rows where the year is not 2019 or 2020 will be dropped.

There are 197 rows that are not in 2019 or 2020.

In [56]:
df_validated_dropoff_year = (
    df_validated_pickup_year
    .withColumn('dropoff_year', F.year('pickup_datetime'))
    .filter(F.col('dropoff_year').isin([2019, 2020]))
)

## Validate: trip duration

The the trip duration is the difference between the `dropoff_datetime` and `pickup_datetime`. The `trip_duration` is in seconds and should be positive.

In [57]:
df_add_trip_duration = (
    df_validated_dropoff_year
    .withColumn('trip_duration', 
                F.col('dropoff_datetime').cast('long') - 
                F.col('pickup_datetime').cast('long'))
)

Assume that trips last at least for one minute, hence drop rows with `trip_duration` < 60.

In [58]:
df_validated_trip_duration = df_add_trip_duration.filter(F.col('trip_duration') > 60)

In [59]:
df_validated_trip_duration.count()

115511074

## Validate `trip_distance`

### Negative `trip_distance`

The `trip_distance` should be > 0. However, there are trips with distance < 0.

These values will be replaced by it's positive distances.

In [60]:
df_validated_trip_duration.filter(F.col('trip_distance') <= 0).count()

133427

In [61]:
df_validated_trip_duration.filter(F.col('trip_distance') == 0).count()

113963

In [62]:
col = 'trip_distance'
df_validated_trip_distance = (
    df_validated_trip_duration
    .withColumn(col, F.abs(F.col(col)))
 )

### Large `trip_distance`s

There are extremely large distances. Only distances up to three standard deviations larger than the mean will be included, any values higher will be regarded as outliers and will be dropped.

There are only 202 trips that are greater than three standard deviations larger than the mean.

In [63]:
df_validated_trip_distance.select('trip_distance').summary().show()

+-------+------------------+
|summary|     trip_distance|
+-------+------------------+
|  count|         115511074|
|   mean|150.91895070136545|
| stddev| 166.4577703229031|
|    min|               0.0|
|    25%|              90.0|
|    50%|             161.0|
|    75%|             232.0|
|    max|         205654.12|
+-------+------------------+



In [64]:
mean = df_validated_trip_distance.select(F.mean(F.col('trip_distance'))).collect()[0][0]
mean

150.91895070136545

In [65]:
stddev = df_validated_trip_distance.select(F.stddev(F.col('trip_distance'))).collect()[0][0]
stddev

166.4577703229031

In [66]:
threshold = mean + 3 * stddev
threshold

650.2922616700747

In [67]:
(
    df_validated_trip_distance
    .filter(F.col('trip_distance') > threshold)
    .count() 
)

344

In [68]:
df_validated_exclude_large_trip = (
    df_validated_trip_distance
    .filter(F.col('trip_distance') <= threshold)
)

## Validate: `Passenger_count`

The domain for this column is [1, 4]. However, there may have been larger taxis that can take up to 9 passengers. Having 0 passengers is beyong my understanding at the moment.  This, along with the nulls could be subject to an imputation strategy in the modelling stage.

In [69]:
col = 'Passenger_count'
valid_values = [1, 2, 3, 4, 5]

In [70]:
(
    df_validated_exclude_large_trip
    .groupBy(col)
    .count()
    .sort(F.col(col))
    .show()
)

+---------------+-------+
|Passenger_count|  count|
+---------------+-------+
|           null| 936032|
|              0|  13368|
|              1|5770149|
|              2| 485748|
|              3|  94673|
|              4| 209552|
|              5| 192596|
|              6| 107799|
|              7| 147881|
|              8|   1452|
|              9|   2613|
|             10|  38210|
|             11|   2336|
|             12|  42494|
|             13| 886922|
|             14|  12133|
|             15|   2299|
|             16|   3740|
|             17|  23843|
|             18|   5980|
+---------------+-------+
only showing top 20 rows



In [73]:
df_passenger = (
    df_validated_exclude_large_trip
    .withColumn(col, when(~F.col(col).isin(valid_values), None)
                     .otherwise(F.col(col)))
)

In [74]:
(
    df_passenger
    .groupBy(col)
    .count()
    .sort(F.col(col))
    .show()
)

+---------------+---------+
|Passenger_count|    count|
+---------------+---------+
|           null|108758012|
|              1|  5770149|
|              2|   485748|
|              3|    94673|
|              4|   209552|
|              5|   192596|
+---------------+---------+



# Feature engineering

## Convert `trip_distance` to km

In [75]:
conversion_factor = 1.60934

In [76]:
df_fe_km = df_passenger.withColumn('trip_distance_km', F.col('trip_distance') * conversion_factor)

## Calculate speed

The average speed (km/h) is calculated by dividing the `trip_distance_km` by the `trip_duration` (seconds) multiplied by 3,600.

There is an inconsistency:
1. either the given unit for `trip_distance` in the data dictionary is incorrect, or
1. the `trip_distance` values are all incorrect, or
1. the `pickup_datetime` and `dropoff_datetime` are incorrect

The `trip_distance`s are reported to be in the unit of miles, however, even with very short `trip_duration`s (less than one hour) the `trip_distance` is often in the hundreds; it's unlikely that a NYC taxi would be able to travel 170 miles in 30 minutes.There is an inconsistency:
1. either the given unit for `trip_distance` in the data dictionary is incorrect, or
1. the `trip_distance` values are all incorrect, or
1. the `pickup_datetime` and `dropoff_datetime` are incorrect

The `trip_distance`s are reported to be in the unit of miles, however, even with very short `trip_duration`s (less than one hour) the `trip_distance` is often in the hundreds; it's unlikely that a NYC taxi would be able to travel 170 miles in 30 minutes.

In [77]:
(
    df_fe_km
    .withColumn('trip_duration_hours', F.col('trip_duration') / 3600)
    .select('dropoff_datetime', 'pickup_datetime', 'trip_duration_hours', 'trip_distance', 'trip_distance_km')
    .sample(fraction=0.0001)
    .limit(10)
    .show()
)

+-------------------+-------------------+-------------------+-------------+------------------+
|   dropoff_datetime|    pickup_datetime|trip_duration_hours|trip_distance|  trip_distance_km|
+-------------------+-------------------+-------------------+-------------+------------------+
|2019-03-20 14:20:56|2019-03-20 14:14:12|0.11222222222222222|         90.0|          144.8406|
|2019-01-01 22:14:49|2019-01-01 22:04:41| 0.1688888888888889|        107.0|         172.19938|
|2019-03-19 16:44:31|2019-03-19 16:04:58| 0.6591666666666667|         88.0|         141.62192|
|2019-01-16 18:24:51|2019-01-16 18:17:27|0.12333333333333334|        142.0|228.52627999999999|
|2019-01-12 12:06:42|2019-01-12 11:50:23|0.27194444444444443|         68.0|         109.43512|
|2019-03-22 19:37:41|2019-03-22 19:32:01|0.09444444444444444|        237.0|         381.41358|
|2019-01-07 08:04:19|2019-01-07 07:48:28|0.26416666666666666|        209.0|         336.35206|
|2019-01-02 10:03:43|2019-01-02 09:43:54| 0.330277

In [78]:
df_speed = (
    df_fe_km
    .withColumn('speed', F.col('trip_distance_km') / F.col('trip_duration') * 3600)
)

In [79]:
df_speed.select('speed', 'trip_duration', 'trip_distance').describe().show()

+-------+------------------+------------------+------------------+
|summary|             speed|     trip_duration|     trip_distance|
+-------+------------------+------------------+------------------+
|  count|         115510730|         115510730|         115510730|
|   mean| 1787.342224620523|1076.5550598026693|150.71781846500346|
| stddev|2002.6832573743222|4252.4727955137105| 78.44062456735868|
|    min|               0.0|                61|               0.0|
|    max|25169.022295081966|           2618881|             603.8|
+-------+------------------+------------------+------------------+



In [80]:
speed_limit = 80

df_speed.filter(F.col('speed') <= speed_limit).count()

9989223

In [81]:
df_speed.filter(F.col('speed') <= speed_limit).count() / df_speed.count()

0.08647874530790343

Only about 10 million rows (~9%) of the data have speeds under 80kph.

### Fixing `trip_distance_km`

According to [careertrend.com](https://careertrend.com/how-many-miles-does-an-average-taxi-cab-driver-drive-yearly-13658842.html), the average taxi trip in the US is 5 miles, which is consistent with the average `trip_duration` of 15 minutes and the speed limit of 25mph in NYC. The `trip_distance` in the data is likely to be incorrect.

1. The average speed for those in with `speed` <= 80kph (50mph) will be calculated
2. This average speed will be used to recalculate the distances based on the `trip_duration`

In [82]:
average_speed = (
    df_speed
    .filter(F.col('speed') <= speed_limit)
    .select(F.mean(F.col('speed')))
    .collect()[0][0]
)

average_speed

24.355786383930308

For rows with speeds less than 80kph, the average speed is 25kph. This is a more reasonable value considering the dense traffic in NYC.

In [83]:
df_fixed_distance = (
    df_speed
    .withColumn('speed', when(F.col('speed') > speed_limit, average_speed)
                                    .otherwise(F.col('speed')))
    .withColumn('trip_distance_km', F.col('speed') * F.col('trip_duration') / 3600)
)

In [84]:
df_fixed_distance.select('speed', 'trip_duration', 'trip_distance_km').describe().show()

+-------+------------------+------------------+------------------+
|summary|             speed|     trip_duration|  trip_distance_km|
+-------+------------------+------------------+------------------+
|  count|         115510730|         115510730|         115510730|
|   mean|24.355786383296174|1076.5550598026693| 6.526585399908055|
| stddev|4.8149381737625605|4252.4727955137105|14.318990958080045|
|    min|               0.0|                61|               0.0|
|    max| 79.99982927977405|           2618881| 830.2424361743164|
+-------+------------------+------------------+------------------+



# Cleanup

Drop unused columns.

In [85]:
df_cleaned = df_fixed_distance.drop('trip_distance')

In [86]:
df_cleaned.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- Store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- Passenger_count: integer (nullable = true)
 |-- Fare_amount: float (nullable = true)
 |-- extra: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- improvement_surcharge: float (nullable = true)
 |-- Total_amount: float (nullable = true)
 |-- Payment_type: string (nullable = true)
 |-- congestion_surcharge: float (nullable = true)
 |-- colour: string (nullable = true)
 |-- pickup_year: integer (nullable = true)
 |-- pickup_month: integer (nullable = true)
 |-- pickup_dayofyear: integer (nullable = true)
 |-- pickup_dayofmonth: integer (nullable = true)
 |-- picku

# Save data

The data is saved in the parquet format because it is columnar. Columnar storage is preferred because the types of queries that are used are aggregations per column, such as average total amount per group etc.

In [87]:
path = processed_data_dir.joinpath('df_cleaned').as_posix()
df_cleaned.repartition(numPartitions=n_partitions).write.parquet(path, mode='overwrite')

In [88]:
spark.sparkContext.stop()