# Build Predictive Model(s)

In this workbook, you will read the merged dataset you created previously and you will create pipelines to build a binary classification model to predict wether a trip has a tip or not.

Instructions:

1. Read in your merged dataset
2. Use transformes and encoders to perform feature engineering
3. Split into training and testing
4. Build `LogisticRegression` model(s) and train them using pipelines
5. Evaluate the performance of the model(s) using `BinaryClassificationMetrics`

You are welcome to add as many cells as you need below up until the next section. **You must include comments in your code.**

In [1]:
# Start SparkSession
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HW-5").getOrCreate()

In [2]:
# Test whether the Spark can work
spark

In [3]:
from pyspark import SparkContext, SparkConf
sc    = spark.sparkContext

In [4]:
sc

In [5]:
# Import necessary libraries
from pyspark.sql.functions import UserDefinedFunction
from pyspark.ml import Pipeline, Model
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import functions
from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.feature import RFormula
import matplotlib.pyplot as plt
import numpy as np
import datetime

In [6]:
# Load data
nyctaxi = spark.read.parquet("s3://zzzzzzhy0607/merged_data")

In [7]:
# See the data schema
nyctaxi.printSchema()

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- surcharge: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- total_amount: float (nullable = true)
 |-- rate_code: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_time_in_secs: float (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- pickup_longitude: float (nullable = true)
 |-- pickup_latitude: float (nullable = true)
 |-- dropoff_longitude: float (nullable = true)
 |-- dropoff_latitude: float (nullable = true)



Since we want to build a binary classification model to predict whether a trip has a tip or not. We need to add a variable to indicate whether or not there was a tip. We created a column called `tipped`. It is 0 if the tip is 0, otherwise it is 1.

In [8]:
# If tipped 1, otherwise 0
nyctaxi=nyctaxi.withColumn("tipped", when(nyctaxi["tip_amount"]>0.0, 1).otherwise(0))

We also want to create other variables using `pickup_datetime`. Since the numerical time is not useful to interpret the results. We want to create the following variables:
* `pickup_hour` from `pickup_datetime`, which indicates the hour of a day
* `weekday` from `pickup_datetime`, which indicates the day of a week

In [9]:
# Create other variables
nyctaxi=nyctaxi.withColumn("pickup_hour", hour(col("pickup_datetime")))
nyctaxi=nyctaxi.withColumn("weekday", date_format("pickup_datetime",'EEEE'))

Using the variable `pickup_hour`, create a categorical variable to indicate whether the pick up time is in rush hour 
* If the value of the pickup hour is at-or-before 6am, or at-or-after 8pm, then the value is "night"
* If the value of the pickup hour is between 7am and 10am (inclusive), then the value is "am_rush"
* If the value of the pickup hour is between 11am and 3pm (inclusive), then the value is "afternoon"
* If the value of the pickup hour is between 4pm and 7pm (inclusive), then the value is "pm_rush"

In [10]:
nyctaxi.createOrReplaceTempView("nyctaxi_table")
nyctaxi = spark.sql("""SELECT *,
CASE WHEN pickup_hour <= 6 OR pickup_hour >= 20 THEN 'night'
     WHEN pickup_hour >= 7 AND pickup_hour <= 10 THEN 'am_rush'
     WHEN pickup_hour >= 11 AND pickup_hour <= 15 THEN 'afternoon'
     ELSE 'pm_rush' END AS time_bins
FROM nyctaxi_table""")

In [11]:
# Check the schema now
nyctaxi.printSchema()

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- surcharge: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- total_amount: float (nullable = true)
 |-- rate_code: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_time_in_secs: float (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- pickup_longitude: float (nullable = true)
 |-- pickup_latitude: float (nullable = true)
 |-- dropoff_longitude: float (nullable = true)
 |-- dropoff_latitude: float (nullable = true)
 |-- tipped: integer (nullable = false)
 |-- pickup_hour:

In [12]:
# To build the predictive model, we need to convert all the string fields to numeric ones 
# Use the StringIndexer transformer
# take data in, and produce new data
string_vendor = StringIndexer(inputCol="vendor_id", outputCol="vendor_X")
string_rate = StringIndexer(inputCol="rate_code", outputCol="rate_X")
string_payment = StringIndexer(inputCol="payment_type", outputCol="payment_X")
string_time = StringIndexer(inputCol="time_bins", outputCol="time_bins_X")

In [13]:
# Use OneHotEncoder 
# Take data in and produce a transformer
# Convert the index to a vector of dummy variables.
encoder_vendor = OneHotEncoder(inputCol="vendor_X", outputCol="vendor_vec", dropLast=False)
encoder_rate = OneHotEncoder(inputCol="rate_X", outputCol="rate_vec", dropLast=False)
encoder_payment = OneHotEncoder(inputCol="payment_X", outputCol="payment_vec", dropLast=False)
encoder_time = OneHotEncoder(inputCol="time_bins_X", outputCol="time_bins_vec", dropLast=False)

In [14]:
# Build a pipeline called nyc_final
# Run a fit method and a transform method to get the desired results.
nyc_final = Pipeline(stages=[string_vendor, encoder_vendor, string_rate, encoder_rate, 
                             string_payment, encoder_payment, string_time, encoder_time
                            ]).fit(nyctaxi).transform(nyctaxi)

In [15]:
# Show the first 10 rows of the final datatset
nyc_final.show(10)

+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------+-----------+---------+---------+--------+-------------+------+--------------+---------+-------------+-----------+-------------+
|           medallion|        hack_license|vendor_id|    pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|rate_code|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|tipped|pickup_hour|  weekday|time_bins|vendor_X|   vendor_vec|rate_X|      rate_vec|payment_X|  payment_vec|time_bins_X|time_bins_vec|
+--------------------+--------------------+---------+-------------------+------------+--------

In [16]:
# Split into training and testing
splitted_data = nyc_final.randomSplit([0.8, 0.2], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

Number of training records: 138553296
Number of testing records : 34631795


In [17]:
# Cache the two datasets
train_data.cache()
test_data.cache()

DataFrame[medallion: string, hack_license: string, vendor_id: string, pickup_datetime: timestamp, payment_type: string, fare_amount: float, surcharge: float, mta_tax: float, tip_amount: float, tolls_amount: float, total_amount: float, rate_code: int, store_and_fwd_flag: string, dropoff_datetime: timestamp, passenger_count: int, trip_time_in_secs: float, trip_distance: float, pickup_longitude: float, pickup_latitude: float, dropoff_longitude: float, dropoff_latitude: float, tipped: int, pickup_hour: int, weekday: string, time_bins: string, vendor_X: double, vendor_vec: vector, rate_X: double, rate_vec: vector, payment_X: double, payment_vec: vector, time_bins_X: double, time_bins_vec: vector]

In [18]:
# Build LogisticRegression model
log_reg = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

In [19]:
# Using Pickup_hour, Passenger count, Trip time, Trip distance
# Fare amount, Vendor id, Payment type, Rate code, Time bins
class_formula = RFormula(formula="tipped ~ pickup_hour + passenger_count + trip_time_in_secs + trip_distance + fare_amount + vendor_X + rate_X + payment_X + time_bins_X")

In [20]:
# Build model using pipline
model = Pipeline(stages=[class_formula, log_reg]).fit(train_data)

In [21]:
# Make predictions on the test dataset
predictions = model.transform(test_data)

In [22]:
# Convert the prediction results into rdd
predictions_and_labels = predictions['label', 'prediction'].rdd

In [23]:
# Build metrics on the rdd 
metrics = BinaryClassificationMetrics(predictions_and_labels)

## In the following cells, please provide the requested code and output. Do not change the order and/or structure of the cells.

In the following cell, print the Area Under the Curve (AUC) for your binary classifier.

In [24]:
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under ROC = 0.9832633696283052


In the following cell, provide the code that saves your model your S3 bucket.

In [25]:
# Save the model to S3
model.save("s3://zzzzzzhy0607/hw5_model/")

In [26]:
sc.stop()