# Introduction to XGBoost Spark with GPU

Agaricus is an example of xgboost classifier. In this notebook, we will show you how to load data, train the xgboost model and use this model to predict if a mushroom is "poisonous".

A few libraries are required:
  1. NumPy
  2. cudf jar
  2. xgboost4j jar
  3. xgboost4j-spark jar

#### Import All Libraries

In [1]:
from ml.dmlc.xgboost4j.scala.spark import XGBoostClassificationModel, XGBoostClassifier
from ml.dmlc.xgboost4j.scala.spark.rapids import GpuDataReader
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, StructField, StructType
from time import time

Note on CPU version: `GpuDataReader` is not necessary, but two extra libraries are required.
```Python
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col
```

#### Create Spark Session

In [2]:
spark = SparkSession.builder.getOrCreate()

#### Specify the Data Schema and Load the Data

In [3]:
label = 'label'
features = [ 'feature_' + str(i) for i in range(0, 126) ]
schema = StructType([ StructField(x, FloatType()) for x in [label] + features ])

train_data = GpuDataReader(spark).schema(schema).option('header', True).csv('/data/datasets/agaricus-small/train')
eval_data = GpuDataReader(spark).schema(schema).option('header', True).csv('/data/datasets/agaricus-small/eval')

Note on CPU version: Data reader is created with `spark.read` instead of `GpuDataReader(spark)`. Also vectorization is required, which means you need to assemble all feature columns into one column.
```Python
def vectorize(data_frame):
    to_floats = [ col(x.name).cast(FloatType()) for x in data_frame.schema ]
    return (VectorAssembler()
        .setInputCols(features)
        .setOutputCol('features')
        .transform(data_frame.select(to_floats))
        .select(col('features'), col(label)))

train_data = spark.read.schema(schema).option('header', True).csv('/data/datasets/agaricus-small/train')
eval_data = spark.read.schema(schema).option('header', True).csv('/data/datasets/agaricus-small/eval')

train_data = vectorize(train_data)
eval_data = vectorize(eval_data)
```

#### Create XGBoostClassifier

In [4]:
params = { 
    'eta': 0.1,
    'missing': 0.0,
    'treeMethod': 'gpu_hist',
    'maxDepth': 2,
    'numWorkers': 1,
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)

Note on CPU version: The CPU version provides the `setFeaturesCol` function, that's why vectorization is required. The parameter `num_workers` should be set to the number of machines with GPU in Spark cluster in GPU version, while it can be set to the number of your CPU cores in CPU version. The tree method `gpu_hist` is designed for GPU training, while tree method `hist` is designed for CPU training.
```Python
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCol('features')
```

#### Train the Data with Benchmark

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Training', lambda: classifier.fit(train_data))

Training takes 6.64 seconds


#### Save and Reload the Model

In [6]:
model.save('/data/new-model-path')
loaded_model = XGBoostClassificationModel().load('/data/new-model-path')

#### Transformation and Show Result Sample

In [7]:
def transform():
    result = loaded_model.transform(eval_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transformation', transform)
result.select(label, 'rawPrediction', 'probability', 'prediction').show(5)

Transformation takes 3.55 seconds
+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|  1.0|[-0.5428417325019...|[0.45715826749801...|       1.0|
|  0.0|[-0.4572062194347...|[0.54279378056526...|       0.0|
|  0.0|[-0.4572062194347...|[0.54279378056526...|       0.0|
|  0.0|[-0.4514296054840...|[0.54857039451599...|       0.0|
|  0.0|[-0.5428417325019...|[0.45715826749801...|       1.0|
+-----+--------------------+--------------------+----------+
only showing top 5 rows



#### Evaluation

In [8]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: MulticlassClassificationEvaluator().setLabelCol(label).evaluate(result))
print('Accuracy is ' + str(accuracy))

Evaluation takes 0.64 seconds
Accuracy is 0.9571535223397293


#### Stop

In [9]:
spark.stop()