# Predicting NYC Taxi Fares with RAPIDS

[RAPIDS](https://rapids.ai/) is a suite of GPU accelerated data science libraries with APIs that should be familiar to users of Pandas, Dask, and Scikitlearn.

This notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on mutliple nodes as part of Google Cloud Dataproc.

Anaconda has graciously made some of the NYC Taxi dataset available in [a public Google Cloud Storage bucket](https://console.cloud.google.com/storage/browser/anaconda-public-data/nyc-taxi/csv/). We'll use our Dataproc Cluster of GPUs to process it and train a model that predicts the fare amount.

In [None]:
import numpy as np
import numba, xgboost, socket
import dask, dask_cudf
from dask.distributed import Client, wait

# connect to the Dask cluster created at Dataproc startup time
# if you're not using Dataproc, see https://github.com/rapidsai/dask-cuda for help
client = Client(socket.gethostname()+':8786')
# forces workers to restart. useful to ensure GPU memory is clear
client.restart()

# limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.get('distributed.scheduler.work-stealing')
dask.config.set({'distributed.scheduler.bandwidth': 1})
dask.config.get('distributed.scheduler.bandwidth')

client

# Inspecting the Data

Now that we have a cluster of GPU workers, we'll use [dask-cudf](https://github.com/rapidsai/dask-cudf/) to load and parse a bunch of CSV files into a distributed DataFrame.

In [None]:
base_path = 'gcs://anaconda-public-data/nyc-taxi/csv/'

df_2014 = dask_cudf.read_csv(base_path+'2014/yellow_*.csv')
df_2014.head().to_pandas()

# Data Cleanup

As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.

For example, in the 2014 taxi CSV files, there are `pickup_datetime` and `dropoff_datetime` columns. The 2015 CSVs have `tpep_pickup_datetime` and `tpep_dropoff_datetime`, which are the same columns. One year has `rate_code`, and another `RateCodeID`.

Also, some CSV files have column names with extraneous spaces in them.

Worst of all, starting in the July 2016 CSVs, pickup & dropoff latitude and longitude data were replaced by location IDs, making the second half of the year useless to us.

We'll do a little string manipulation, column renaming, and concatenating of DataFrames to sidestep the problems.

In [None]:
# list of column names that need to be re-mapped
remap = {}
remap['tpep_pickup_datetime'] = 'pickup_datetime'
remap['tpep_dropoff_datetime'] = 'dropoff_datetime'
remap['ratecodeid'] = 'rate_code'

#create a list of columns & dtypes the df must have
must_haves = {
 'pickup_datetime': 'datetime64[ms]',
 'dropoff_datetime': 'datetime64[ms]',
 'passenger_count': 'int32',
 'trip_distance': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'rate_code': 'int32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'fare_amount': 'float32'
}

In [None]:
# helper function which takes a DataFrame partition
def clean(df_part, remap, must_haves):    
    # some col-names include pre-pended spaces remove & lowercase column names
    tmp = {col:col.strip().lower() for col in list(df_part.columns)}
    df_part = df_part.rename(tmp)
    
    # rename using the supplied mapping
    df_part = df_part.rename(remap)
    
    # iterate through columns in this df partition
    for col in df_part.columns:
        # drop anything not in our expected list
        if col not in must_haves:
            df_part = df_part.drop(col)
            continue
        
        # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler
        if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:
            df_part[col] = df_part[col].astype('datetime64[ms]')
            continue
                
        # if column was read as a string, recast as float
        if df_part[col].dtype == 'object':
            df_part[col] = df_part[col].str.fillna('-1')
            df_part[col] = df_part[col].astype('float32')
        else:
            # downcast from 64bit to 32bit types
            # Tesla T4 are faster on 32bit ops
            if 'int' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('int32')
            if 'float' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('float32')
            df_part[col] = df_part[col].fillna(-1)
    
    return df_part

In [None]:
df_2014 = df_2014.map_partitions(clean, remap, must_haves)
df_2014.head().to_pandas()

# Increasing Our Training Data Size

We still have 2015 and the first half of 2016's data to read and clean.

In [None]:
df_2015 = dask_cudf.read_csv(base_path+'2015/yellow_*.csv').map_partitions(clean, remap, must_haves)

## Handling 2016's Mid-Year Schema Change

In 2016, only January - June CSVs have the columns we need. If we try to read `base_path+2016/yellow_*.csv`, Dask will not appreciate having differing schemas in the same DataFrame.

Instead, we'll need to create a list of the valid months and read them independently.

In [None]:
months = [str(x).rjust(2, '0') for x in range(1, 7)]
valid_files = [base_path+'2016/yellow_tripdata_2016-'+month+'.csv' for month in months]
valid_files

In [None]:
# read & clean 2016 data and concat all DFs
df_2016 = dask_cudf.read_csv(valid_files).map_partitions(clean, remap, must_haves)

# concatenate multiple DataFrames into one bigger one
taxi_df = dask.dataframe.multi.concat([df_2014, df_2015, df_2016])

In [None]:
# apply a list of filter conditions to throw out records with missing or outlier values
query_frags = [
    'fare_amount > 0 and fare_amount < 500',
    'passenger_count > 0 and passenger_count < 6',
    'pickup_longitude > -75 and pickup_longitude < -73',
    'dropoff_longitude > -75 and dropoff_longitude < -73',
    'pickup_latitude > 40 and pickup_latitude < 42',
    'dropoff_latitude > 40 and dropoff_latitude < 42'
]
taxi_df = taxi_df.query(' and '.join(query_frags))

# inspect the results of cleaning
taxi_df.head().to_pandas()

# Adding Interesting Features

Dask & cuDF provide standard DataFrame operations, but also let you run "user defined functions" on the underlying data.

cuDF's [apply_rows](https://rapidsai.github.io/projects/cudf/en/0.6.0/api.html#cudf.dataframe.DataFrame.apply_rows) operation is similar to Pandas's [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), except that for cuDF, custom Python code is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels.

We'll use a Haversine Distance calculation to find total trip distance, and extract additional useful variables from the datetime fields.

In [None]:
import math
from math import cos, sin, asin, sqrt, pi

def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):
        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        h_distance[i] = c * r

def day_of_the_week_kernel(day, month, year, day_of_week):
    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):
        if month[i] <3:
            shift = month[i]
        else:
            shift = 0
        Y = year[i] - (month[i] < 3)
        y = Y - 2000
        c = 20
        d = day[i]
        m = month[i] + shift + 1
        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7
        
def add_features(df):
    df['hour'] = df['pickup_datetime'].dt.hour
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')
    
    df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01
    df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01
    df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01
    df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01
    
    df = df.drop('pickup_datetime')
    df = df.drop('dropoff_datetime')
    
    
    df = df.apply_rows(haversine_distance_kernel,
                   incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],
                   outcols=dict(h_distance=np.float32),
                   kwargs=dict())
    
    
    df = df.apply_rows(day_of_the_week_kernel,
                      incols=['day', 'month', 'year'],
                      outcols=dict(day_of_week=np.float32),
                      kwargs=dict())
    
    
    df['is_weekend'] = (df['day_of_week']<2).astype(np.int32)
    return df

In [None]:
# actually add the features
taxi_df = taxi_df.map_partitions(add_features)
# inspect the result
taxi_df.head().to_pandas()

# Pick a Training Set

Let's imagine you're making a trip to New York on the 25th and want to build a model to predict what fare prices will be like the last few days of the month based on the first part of the month. We'll use a query expression to identify the `day` of the month to use to divide the data into train and test sets.

The wall-time below represents how long it takes your GPU cluster to load data from the Google Cloud Storage bucket and the ETL portion of the workflow.

In [None]:
%%time
X_train = taxi_df.query('day < 25').persist()

# create a Y_train ddf with just the target variable
Y_train = X_train[['fare_amount']].persist()
# drop the target variable from the training ddf
X_train = X_train[X_train.columns.difference(['fare_amount'])]

# this wont return until all data is in GPU memory
done = wait([X_train, Y_train])

In [None]:
len(X_train)

# Train the XGBoost Regression Model

The wall time output below indicates how long it took your GPU cluster to train an XGBoost model over the training set.

In [None]:
%%time

import dask_xgboost as dxgb_gpu

params = {
 'learning_rate': 0.3,
  'max_depth': 8,
  'objective': 'reg:squarederror',
  'subsample': 0.6,
  'gamma': 1,
  'silent': True,
  'verbose_eval': True,
  'tree_method':'gpu_hist',
  'n_gpus': 1
}

trained_model = dxgb_gpu.train(client, params, X_train, Y_train, num_boost_round=100)

# How Good is Our Model?

Now that we have a trained model, we need to test it with the 25% of records we held out.

Based on the filtering conditions applied to this dataset, many of the DataFrame partitions will wind up having 0 rows.

This is a problem for XGBoost which doesn't know what to do with 0 length arrays. We'll apply a bit of Dask logic to check for and drop partitions without any rows.

In [None]:
def drop_empty_partitions(df):
    lengths = df.map_partitions(len).compute()
    nonempty = [length > 0 for length in lengths]
    return df.partitions[nonempty]

In [None]:
X_test = taxi_df.query('day >= 25').persist()
X_test = drop_empty_partitions(X_test)

# Create Y_test with just the fare amount
Y_test = X_test[['fare_amount']]

# Drop the fare amount from X_test
X_test = X_test[X_test.columns.difference(['fare_amount'])]

# display test set size
len(X_test)

In [None]:
# generate predictions on the test set
Y_test['prediction'] = dxgb_gpu.predict(client, trained_model, X_test)

# Compute Root Mean Squared Error

In [None]:
Y_test['squared_error'] = (Y_test['prediction'] - Y_test['fare_amount'])**2

# inspect the results to make sure our calculation looks right
Y_test.head().to_pandas()

In [None]:
# compute the actual RMSE over the full test set
math.sqrt(Y_test.squared_error.mean().compute())

Not bad! We can predict a taxi fare to within about $1.79.

If I'm planning to head to Strata Data in NYC, I can probably fill out my ground transportation expense items ahead of time.

# Save Trained Model for Later Use

To make a model maximally useful, you need to be able to save it for later use.

We'll use Google Cloud Storage to persist the trained model in a [dill](https://pypi.org/project/dill/) file.

In [None]:
import gcsfs, dill

fs = gcsfs.GCSFileSystem()
# replace with a bucket you own
bucket = 'rapidsai-test-1/'

with fs.open(bucket+'trained_model.dill', 'wb') as file:  
    dill.dump(trained_model, file)

# Reload a Saved Model from Disk

You can also read the saved model back out of Google Cloud Storage and into a normal XGBoost model object.

In [None]:
with fs.open(bucket+'trained_model.dill', 'rb') as file:  
    model_from_disk = dill.load(file)

# Generate predictions on the test set again, but this time using the reloaded model
Y_test['prediction'] = dxgb_gpu.predict(client, model_from_disk, X_test)

# Verify that the predictions result in the same RMSE error
Y_test['squared_error'] = (Y_test['prediction'] - Y_test['fare_amount'])**2
math.sqrt(Y_test.squared_error.mean().compute())

# Takeaways

We just demonstrated how to use GPU DataFrames to scale ETL style operations out to multiple GPUs on multiple nodes.

We also showed how to pass prepared data directly to XGBoost without having the data ever leave GPU memory. As a result, we can run end to end data processing _and_ model training faster, using less hardware than with a CPU based solution.

While other workflows will be more complex or operate on larger dataset sizes, our hope is that pre-processing and training on approximately 70GB (360 million rows) in about 4 minutes shows that GPUs can offer speed ups that give Data Scientists less time to drink coffee, and more time to iterate on and tune model performance.

What now?

[Check out RAPIDS on GitHub](https://github.com/rapidsai) and follow the development, or pitch in by reporting issues, making pull requests or even just requesting the features your workflows need. We look forward to hearing from you!