# Introduction to XGBoost Spark with GPU

Taxi is an example of xgboost regressor. In this notebook, we will show you how to load data, train the xgboost model and use this model to predict "fare_amount" of your taxi trip.

A few libraries are required:
  1. NumPy
  2. cudf jar
  3. xgboost4j jar
  4. xgboost4j-spark jar


#### Import All Libraries

In [1]:
from ml.dmlc.xgboost4j.scala.spark import XGBoostRegressionModel, XGBoostRegressor
from ml.dmlc.xgboost4j.scala.spark.rapids import GpuDataReader
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType
from time import time

Note on CPU version: `GpuDataReader` is not necessary, but two extra libraries are required.
```Python
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col
```

#### Create Spark Session

In [2]:
spark = SparkSession.builder.getOrCreate()

#### Specify the Data Schema and Load the Data

In [3]:
label = 'fare_amount'
schema = StructType([
    StructField('vendor_id', FloatType()),
    StructField('passenger_count', FloatType()),
    StructField('trip_distance', FloatType()),
    StructField('pickup_longitude', FloatType()),
    StructField('pickup_latitude', FloatType()),
    StructField('rate_code', FloatType()),
    StructField('store_and_fwd', FloatType()),
    StructField('dropoff_longitude', FloatType()),
    StructField('dropoff_latitude', FloatType()),
    StructField(label, FloatType()),
    StructField('hour', FloatType()),
    StructField('year', IntegerType()),
    StructField('month', IntegerType()),
    StructField('day', FloatType()),
    StructField('day_of_week', FloatType()),
    StructField('is_weekend', FloatType()),
])
features = [ x.name for x in schema if x.name != label ]

train_data = GpuDataReader(spark).schema(schema).option('header', True).csv('/data/datasets/taxi-small/train')
eval_data = GpuDataReader(spark).schema(schema).option('header', True).csv('/data/datasets/taxi-small/eval')

Note on CPU version: Data reader is created with `spark.read` instead of `GpuDataReader(spark)`. Also vectorization is required, which means you need to assemble all feature columns into one column.
```Python
def vectorize(data_frame):
    to_floats = [ col(x.name).cast(FloatType()) for x in data_frame.schema ]
    return (VectorAssembler()
        .setInputCols(features)
        .setOutputCol('features')
        .transform(data_frame.select(to_floats))
        .select(col('features'), col(label)))

train_data = spark.read.schema(schema).option('header', True).csv('/data/datasets/taxi-small/train')
eval_data = spark.read.schema(schema).option('header', True).csv('/data/datasets/taxi-small/eval')

train_data = vectorize(train_data)
eval_data = vectorize(eval_data)
```

#### Create XGBoostRegressor

In [4]:
params = { 
    'eta': 0.05,
    'treeMethod': 'gpu_hist',
    'maxDepth': 8,
    'subsample': 0.8,
    'gamma': 1.0,
    'numRound': 100,
    'numWorkers': 1,
}
regressor = XGBoostRegressor(**params).setLabelCol(label).setFeaturesCols(features)

Note on CPU version: The CPU version provides the `setFeaturesCol` function, that's why vectorization is required. The parameter `num_workers` should be set to the number of machines with GPU in Spark cluster in GPU version, while it can be set to the number of your CPU cores in CPU version. The tree method `gpu_hist` is designed for GPU training, while tree method `hist` is designed for CPU training.
```Python
regressor = XGBoostRegressor(**params).setLabelCol(label).setFeaturesCol('features')
```

#### Train the Data with Benchmark

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Training', lambda: regressor.fit(train_data))

Training takes 6.42 seconds


#### Save and Reload the Model

In [6]:
model.write().overwrite().save('/data/new-model-path')
loaded_model = XGBoostRegressionModel().load('/data/new-model-path')

#### Transformation and Show Result Sample

In [7]:
def transform():
    result = loaded_model.transform(eval_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transformation', transform)
result.select('vendor_id', 'passenger_count', 'trip_distance', label, 'prediction').show(5)

Transformation takes 2.89 seconds
+------------+---------------+-------------+-----------+------------------+
|   vendor_id|passenger_count|trip_distance|fare_amount|        prediction|
+------------+---------------+-------------+-----------+------------------+
|1.55973043E9|            2.0|          0.7|        5.0|  4.87195348739624|
|1.55973043E9|            3.0|         10.7|       34.0| 32.97749328613281|
|1.55973043E9|            1.0|          2.3|       10.0|10.095549583435059|
|1.55973043E9|            1.0|          4.4|       16.5| 17.00450325012207|
|1.55973043E9|            1.0|          1.5|        7.0| 7.456318378448486|
+------------+---------------+-------------+-----------+------------------+
only showing top 5 rows



Note on CPU version: You cannot `select` the feature columns after vectorization. So please use `result.show(5)` instead.

#### Evaluation

In [8]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: RegressionEvaluator().setLabelCol(label).evaluate(result))
print('RMSI is ' + str(accuracy))

Evaluation takes 0.24 seconds
RMSI is 2.334654135967194


#### Stop

In [9]:
spark.stop()