# Introduction to End-to-End RAPIDS Workflows

This tutorial will teach developers how to build an end-to-end workflow with cuDF, cuML, and accelerated XGBoost. You will have the chance to ingest data, conduct ETL, perform EDA, train an XGBoost model, and use SHAP to gain insights into the predictions made by the model. 


We're going to be working with data from the [CitiBike data set](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-citi-bike?pli=1&project=nv-ai-infra). CitiBike is a bike rental company which operates in NYC. Bikes are 'stored' at docking stations around the city, and users can rent a bike and return it to any docking station. We will use the historical information to attempt to predict the duration of a user's ride, given their starting station, as well as some other information. 


Before we begin, we're going to check what kind of GPU we have using [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface). `nvidia-smi` has a whole range of functions described at the link. We are just going to use it to see general information about our GPU.

In [None]:
!nvidia-smi

Here we see that we have a [16 GB card](https://www.google.com/search?q=mib+to+gb&ei=3s07YsmeELHt9AP4z5zgDg&ved=0ahUKEwjJhbPQ0N32AhWxNn0KHfgnB-wQ4dUDCA4&uact=5&oq=mib+to+gb&gs_lcp=Cgdnd3Mtd2l6EAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAEEcQsAMyBwgAELADEEMyBwgAELADEEMyBwgAELADEEMyBwgAELADEEMyCggAEOQCELADGAEyCggAEOQCELADGAEyCggAEOQCELADGAEyDwguENQCEMgDELADEEMYAjIPCC4Q1AIQyAMQsAMQQxgCSgQIQRgASgQIRhgBUABYAGCABmgBcAF4AIABAIgBAJIBAJgBAMgBEcABAdoBBggBEAEYCdoBBggCEAEYCA&sclient=gws-wiz). If we had multiple cards, we would use `dask_cudf`. This will be covered in another notebook. 

## Importing the data

Before we begin, we need to install a couple of packages.


In [None]:
!pip install google-cloud-bigquery

In [None]:
!pip install db_dtypes

The CitiBike data is available for download directly from an BigQuery. In the following cell, we import the data from 2014 only. 


_You can change the years and the number of years in the cell below by altering the `WHERE` statement._

In [None]:
import os
import time
import cupy as cp
import cudf 
from google.cloud import bigquery

#os.environ.setdefault("GCLOUD_PROJECT", "hotornot-1078")

query = """
SELECT * 
FROM `bigquery-public-data.new_york_citibike.citibike_trips` 
WHERE EXTRACT(YEAR from starttime) = 2014
"""
client = bigquery.Client()
job = client.query(query)
pd_df = job.to_dataframe()
df = cudf.from_pandas(pd_df)
del(pd_df)

Let's look at the data. 

In [None]:
df.head()

Let's see the data types we have from the import.

In [None]:
df.dtypes

Let's take a quick look at the data. We see that when `df.describe()` is given mixed types, we should tell it to include all the data.

In [None]:
df.describe(include='all')

## Data Cleaning and Feature Engineering

The data contains some redundant information - `start_station_id` and `end_station_id` are both captured by the station names and latitude/longitude data. We drop this redundant information. 

We also remove all information about the end station. We wish to predict the duration of the user's ride at the point of pick up, and their bike drop-off destination would not be known to us at that time. 

We don't expect the `bike_id` to give us insight into ride duration so we remove that from the data set.

We drop infromation based on `tripduration`, starting with observations where `tripduration` is negative - Bikes can do a lot of things, but they can't travel back in time!  

We remove any trips lasting less than five minutes, as these are likely to indicate a malfunctioning bike which is quickly returned, rather than a real journey. 

We also drop all rides that lasted longer than 10 hours from our data  -  The citi bikes are supposed to be used for relatively short trips round the city, and are not suitable for long journeys. We don't want this data to skew our model.

Finally, we drop all recorded rides that contain missing data for any of the remaining columns.

In [None]:
df = df.drop(['start_station_id', 'end_station_id', 'end_station_name', 'bikeid', 'stoptime', 'end_station_latitude', 'end_station_longitude'], axis=1) 


In [None]:
df['tripduration'] = df['tripduration'].where(df['tripduration']>300)
df['tripduration'] = df['tripduration'].where(df['tripduration']<=36000)
df = df.dropna()

Next, we grab some things from the time fields that will be useful as features for our model. We're doing to create a variable grouping the time of day into one of six periods, the day of week, the month, and then we're going to drop those time variables. The exact second a bike was rented or returned likely has limited explanatory value. 

In [None]:
df['start_hour_of_the_day'] = df['starttime'].dt.hour
df['dow'] = df['starttime'].dt.dayofweek
df['month'] = df['starttime'].dt.month
df = df.drop(['starttime'], axis=1)


We're going to use cuML for the next bit of ETL to encode labels into numbers for our analysis. 

In [None]:
import cuml

le = cuml.LabelEncoder()
df['start_station_name'] = le.fit_transform(df['start_station_name'])
df['usertype'] = le.fit_transform(df['usertype'])
df['gender'] = le.fit_transform(df['gender'])
df['customer_plan'] = le.fit_transform(df['customer_plan'])

Given that we are aiming to predict the length of the ride in seconds, it seems unfair to include both the hour at which the journey starts and the hour at which the journey stops in our feature vectors - let's remove this now, and see how well we can predict trip duration.

In [None]:
df.head()

In [None]:
df.shape

Now that our data is cleaned up, it's time to see how well we can predict trip duration. We'll start by making a simple XGBoost model and then we will move onto an ensemble with some other methods with cuML.

## XGBoost Prediction Model

First, we want to split our data into train and test sets. We do this with cuML. 

In [None]:
X_train, X_test, y_train, y_test = cuml.train_test_split(df, 'tripduration', train_size=0.8)

In [None]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test) 

In [None]:
params = {
    'learning_rate': 0.01,
    'max_depth': 5,
    'objective': 'reg:squarederror',
    'subsample': 0.8,
    'disable_default_eval_metric':True, 
    'tree_method':'gpu_hist' 
}

trained_model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train')]
)

We'll save the trained model, so that we can re-load it later.

In [None]:
trained_model.save_model("xgb.model")

Now let's see how well our fitted model looks out in the wild.

In [None]:
prediction = trained_model.predict(dtest).astype('int64')
print("RMSE: {}".format(cp.sqrt(cuml.metrics.mean_squared_error(y_test.values, prediction))))

Looks like our model's predictions are off 13 minutes with our quick model - why not see if you can change the parameter values and improve the model's performance. 

## Model Explainability with SHAP

When using complex models, such as XGBoost, it's not always straightforward to understand the predictions made by the model. In this section we use Shapley Additive Explanation (SHAP) values to gain insight into the Machine Learning model.

Computing SHAP values is a computationally expensive procedure, but we accelerate the procedure by running on NVIDIA GPUs. To save more time, we compute SHAP values on a subset of our data.

Much of the code in this section is taken from this great [blog](https://medium.com/rapids-ai/gpu-accelerated-shap-values-with-xgboost-1-3-and-rapids-587fad6822) on GPU-Accelerated SHAP Values. 

In [None]:
shap_sample = xgb.DMatrix(X_test.sample(frac=0.01))

In [None]:
%%time
trained_model.set_param({"predictor": "gpu_predictor"})
shap_values = trained_model.predict(shap_sample, pred_contribs=True)

We can aggregate and visualse these SHAP values to see which of the features in our data had the most impact on the predictions made by our model. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot_feature_importance(feature_names, shap_values):
 # Get the mean absolute contribution for each feature
 aggregate = np.mean(np.abs(shap_values[:, 0:-1]), axis=0)
 # sort by magnitude
 z = [(x, y) for y, x in sorted(zip(aggregate, feature_names), reverse=True)]
 z = list(zip(*z))
 plt.bar(z[0], z[1])
 plt.xticks(rotation=90)
 plt.tight_layout()
 plt.show()


plot_feature_importance(X_test.columns, shap_values)

This shows us that the most important features in predicting ride duration are the location of the pick up point. 


We can also use SHAP to consider the importance of interactions between features. This is more computationally expensive again, but can bring valuable insights. The following cell will take around 100 seconds to run. 

In [None]:
%%time
shap_interactions = trained_model.predict(shap_sample, pred_interactions=True)

In [None]:
def plot_top_k_interactions(feature_names, shap_interactions, k):
 # Get the mean absolute contribution for each feature interaction
 aggregate_interactions = np.mean(np.abs(shap_interactions[:, :-1, :-1]), axis=0)
 interactions = []
 for i in range(aggregate_interactions.shape[0]):
     for j in range(aggregate_interactions.shape[1]):
         if j < i:
             interactions.append(
             (feature_names[i] + "-" + feature_names[j], aggregate_interactions[i][j] * 2))
 # sort by magnitude
 interactions.sort(key=lambda x: x[1], reverse=True)
 interaction_features, interaction_values = map(tuple, zip(*interactions))
 plt.bar(interaction_features[:k], interaction_values[:k])
 plt.xticks(rotation=90)
 plt.tight_layout()
 plt.show()


plot_top_k_interactions(X_test.columns, shap_interactions, 10)

Here we see (unsurprisingly) that the interactions between the starting longitude and latitude greatly influence the predictions, followed by a location and starting time of ride. 

## Accelerating Inference 

Throughout this notebook we have run most of our computation on the GPU. In this Section, we compare the speed it takes to make predictions on a CPU vs the GPU. 

In [None]:
xgb_features = xgb.DMatrix(X_test.astype("float32"))

### CPU

We first re-load the model from file, as XGBoost caches the results of previous predictions. 

In [None]:
%%time
model = xgb.Booster(model_file="xgb.model")
model.set_param({"predictor": "cpu_predictor"})
predictions = model.predict(xgb_features)

### GPU

Now we can again reload the model, and this time run the same predictions on the GPU

In [None]:
%%time
model = xgb.Booster(model_file="xgb.model")
model.set_param({"predictor": "gpu_predictor"})
predictions = model.predict(xgb_features)

So you can see that the GPU allows us to make predictions in a fraction of the time taken on CPU. This is ideal for situations requiring real-time inference. 

## Conclusion

In this notebook you've seen how we can use cuML, cuDF and XGBoost to explore and clean data, compute feature vectors and train a machine learning model to predict ride duration on the CitiBike Data Set. 

To find out more, check out [RAPIDS.ai](http://rapids.ai).