# Introduction to XGBoost Spark with GPU

Agaricus is an example of xgboost classifier. In this notebook, we will show you how to load data, train the xgboost model and use this model to predict if a mushroom is "poisonous". Camparing to original XGBoost Spark codes, there're only two API differences.

## Load libraries
First we load some common libraries that both GPU version and CPU version xgboost will use:

In [1]:
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types.{DoubleType, IntegerType, StructField, StructType}

What is new to xgboost-spark users is only `rapids.GpuDataReader`.

In [2]:
import ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader

Some libraries needed for original CPU version are not needed in GPU version any more. In CPU version, we need to import extra libraries like below:

```scala
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.FloatType
```

## Set your dataset path

In [3]:
// Set the paths of datasets for training and prediction
// You need to update them to your real paths!
val trainPath = "/data/agaricus/csv/train/"
val trainWithEvalPath = "/data/agaricus/csv/trainWithEval/"
val evalPath  = "/data/agaricus/csv/eval/"

trainPath = /data/agaricus/csv/train/
trainWithEvalPath = /data/agaricus/csv/trainWithEval/
evalPath = /data/agaricus/csv/eval/


/data/agaricus/csv/eval/

## Set the schema of the dataset

for agaricus example, the data has 126 dimensions and we name them as "feature_0", "feature_1" ... "feature_126". The schema will be used to help load data in the future.

In [4]:
val labelName = "label"
def featureNames(length: Int): List[String] =
  0.until(length).map(i => s"feature_$i").toList.+:(labelName)

def schema(length: Int): StructType =
  StructType(featureNames(length).map(n => StructField(n, DoubleType)))

val dataSchema = schema(126)

labelName = label
dataSchema = StructType(StructField(label,DoubleType,true), StructField(feature_0,DoubleType,true), StructField(feature_1,DoubleType,true), StructField(feature_2,DoubleType,true), StructField(feature_3,DoubleType,true), StructField(feature_4,DoubleType,true), StructField(feature_5,DoubleType,true), StructField(feature_6,DoubleType,true), StructField(feature_7,DoubleType,true), StructField(feature_8,DoubleType,true), StructField(feature_9,DoubleType,true), StructField(feature_10,DoubleType,true), StructField(feature_11,DoubleType,true), StructField(feature_12,DoubleType,true), StructField(feature_13,DoubleType,true), StructFie...


featureNames: (length: Int)List[String]
schema: (length: Int)org.apache.spark.sql.types.StructType


StructType(StructField(label,DoubleType,true), StructField(feature_0,DoubleType,true), StructField(feature_1,DoubleType,true), StructField(feature_2,DoubleType,true), StructField(feature_3,DoubleType,true), StructField(feature_4,DoubleType,true), StructField(feature_5,DoubleType,true), StructField(feature_6,DoubleType,true), StructField(feature_7,DoubleType,true), StructField(feature_8,DoubleType,true), StructField(feature_9,DoubleType,true), StructField(feature_10,DoubleType,true), StructField(feature_11,DoubleType,true), StructField(feature_12,DoubleType,true), StructField(feature_13,DoubleType,true), StructFie...

## Create a new spark session and load data

We must create a new spark session to continue all spark operations. It will also be used to initilize the `GpuDataReader` which is a data reader powered by GPU.

NOTE: in this notebook, we have uploaded dependency jars when installing toree kernel. If we don't upload them at installation time, we can also upload in notebook by [%AddJar magic](https://toree.incubator.apache.org/docs/current/user/faq/). However, there's one restriction for `%AddJar`: the jar uploaded can only be available when `AddJar` is called after a new spark session is created. We must use it as below:

```scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("agaricus-GPU").getOrCreate
%AddJar file:/data/libs/cudf-0.9.1-cuda10.jar
%AddJar file:/data/libs/xgboost4j_2.11-1.0.0-Beta2.jar
%AddJar file:/data/libs/xgboost4j-spark_2.11-1.0.0-Beta2.jar
// ...
```

In [5]:
// build spark session
val spark = SparkSession.builder.appName("agaricus-gpu").getOrCreate


spark = org.apache.spark.sql.SparkSession@42fadb6c


org.apache.spark.sql.SparkSession@42fadb6c

Here's the first API difference, we now use `GpuDataReader` to load dataset. Similar to original Spark data loading API, `GpuDataReader` also uses chaining call of "option", "schema","csv". For `CPU` verion data reader, the code is like below:

```scala
val dataReader = spark.read
```

`featureCols` is used to tell xgboost which columns are `feature` and while column is `label`

In [6]:
// build data reader
val dataReader = new GpuDataReader(spark)
val featureCols = dataSchema.filter(_.name != labelName).map(_.name)

dataReader = ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader@21f1947a
featureCols = List(feature_0, feature_1, feature_2, feature_3, feature_4, feature_5, feature_6, feature_7, feature_8, feature_9, feature_10, feature_11, feature_12, feature_13, feature_14, feature_15, feature_16, feature_17, feature_18, feature_19, feature_20, feature_21, feature_22, feature_23, feature_24, feature_25, feature_26, feature_27, feature_28, feature_29, feature_30, feature_31, feature_32, feature_33, feature_34, feature_35, feature_36, feature_37, feature_38, feature_39, feature_40, feature_41, feature_42, feature_43, feature_44, feature_45, feature_46, feature_47, feature_48, feature_49, feature_50, feature_51, feature_52, feature_53, fe...


List(feature_0, feature_1, feature_2, feature_3, feature_4, feature_5, feature_6, feature_7, feature_8, feature_9, feature_10, feature_11, feature_12, feature_13, feature_14, feature_15, feature_16, feature_17, feature_18, feature_19, feature_20, feature_21, feature_22, feature_23, feature_24, feature_25, feature_26, feature_27, feature_28, feature_29, feature_30, feature_31, feature_32, feature_33, feature_34, feature_35, feature_36, feature_37, feature_38, feature_39, feature_40, feature_41, feature_42, feature_43, feature_44, feature_45, feature_46, feature_47, feature_48, feature_49, feature_50, feature_51, feature_52, feature_53, fe...

Now we can use `dataReader` to read data directly. However, in CPU version, we have to use `VectorAssembler` to assemble all feature columns into one column. The reason will be explained later. the CPU version code is as below:

```scala
object Vectorize {
  def apply(df: DataFrame, featureNames: Seq[String], labelName: String): DataFrame = {
    val toFloat = df.schema.map(f => col(f.name).cast(FloatType))
    new VectorAssembler()
      .setInputCols(featureNames.toArray)
      .setOutputCol("features")
      .transform(df.select(toFloat:_*))
      .select(col("features"), col(labelName))
  }
}

val trainSet = reader.csv(trainPath)
val evalSet = reader.csv(evalPath)
trainSet = Vectorize(trainSet, featureCols, labelName)
evalSet = Vectorize(evalSet, featureCols, labelName)

```

While with GpuDataReader, `VectorAssembler` is not needed any more. We can simply read data by:

In [7]:
// load data of training and evaluation
var (trainSet, trainWithEvalSet, evalSet) = {
  dataReader.option("header", true).schema(dataSchema)
  (dataReader.csv(trainPath), dataReader.csv(trainWithEvalPath),dataReader.csv(evalPath))}

trainSet = ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset@c35dcf0
trainWithEvalSet = ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset@468d0483
evalSet = ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset@6e9204c7


ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset@6e9204c7

## Set xgboost parameters and initilize XGBoostClassifier

The only difference here is `num_workers` should be set to the number of machines with GPU in Spark cluster, while it can be set to the number of your CPU cores in CPU version:

```scala
// difference in parameters
  "num_workers" -> 12,
  "tree_method" -> "hist",
```

In [8]:
// build XGBoost classifier
val paramMap = Map(
  "eta" -> 0.1,
  "max_depth" -> 2,
  "num_workers" -> 1,
  "tree_method" -> "gpu_hist",
  "missing" -> 0.0,
  "num_round" -> 100
)

paramMap = Map(num_workers -> 1, max_depth -> 2, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1)


Map(num_workers -> 1, max_depth -> 2, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1)

The second API difference is `setFeaturesCol` in CPU version vs `setFeaturesCols` in GPU version. In previous blocks, we said that CPU version need `VectorAssembler` to assemble all feature columns, the reason is: `setFeaturesCol` accepts a String that indicates which vectorized column is the `feature` column. It requires `VectorAssembler` to help vectorize all feature columns into one. However, `setFeaturesCols` accepts a list of strings so that we don't need `VectorAssembler` any more. 

CPU version:

```scala
val xgbClassifier  = new XGBoostClassifier(paramMap)
  .setLabelCol(labelName)
  .setFeaturesCol("features")
```

In [9]:
val xgbClassifier  = new XGBoostClassifier(paramMap)
  .setLabelCol(labelName)
  // === diff ===
  .setFeaturesCols(featureCols)

xgbClassifier = xgbc_00511571b736


xgbc_00511571b736

## Benchmark and train
The benchmark object is for calculating training time. We will use it to compare with xgboost in CPU version.

We also support training with evaluation sets in 2 ways as same as CPU version behavior:

* API `setEvalSets` after initializing an XGBoostClassifier

```scala
xgbClassifier.setEvalSets(Map("eval" -> evalSet))

```

* parameter `eval_sets` when initializing an XGBoostClassifier

```scala
val paramMapWithEval = paramMap + ("eval_sets" -> Map("eval" -> evalSet))
val xgbClassifierWithEval = new XGBoostClassifier(paramMapWithEval)
```

in this notebook, we use API method to set evaluation sets.

In [10]:
xgbClassifier.setEvalSets(Map("eval" -> trainWithEvalSet))

xgbc_00511571b736

In [11]:
object Benchmark {
  def time[R](phase: String)(block: => R): (R, Float) = {
    val t0 = System.currentTimeMillis
    val result = block // call-by-name
    val t1 = System.currentTimeMillis
    println("Elapsed time [" + phase + "]: " + ((t1 - t0).toFloat / 1000) + "s")
    (result, (t1 - t0).toFloat / 1000)
  }
}

// start training
println("\n------ Training ------")
val (xgbClassificationModel, _) = Benchmark.time("train") {
  xgbClassifier.fit(trainSet)
}


------ Training ------
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.124, DMLC_TRACKER_PORT=9092, DMLC_NUM_WORKER=1}
Elapsed time [train]: 8.776s


defined object Benchmark
xgbClassificationModel = xgbc_00511571b736


xgbc_00511571b736

## Transformation and evaluation
We use `evalSet` to evaluate our model and use some key columns to show our predictions. Finally we use `MulticlassClassificationEvaluator` to calculate an overall accuracy of our predictions.

In [12]:
// start transform
println("\n------ Transforming ------")
val (results, _) = Benchmark.time("transform") {
  val ret = xgbClassificationModel.transform(evalSet).cache()
  ret.foreachPartition(_ => ())
  ret
}
results.select(labelName, "rawPrediction", "probability", "prediction").show(10)

println("\n------Accuracy of Evaluation------")
val evaluator = new MulticlassClassificationEvaluator()
evaluator.setLabelCol(labelName)
val accuracy = evaluator.evaluate(results)

println(s"accuracy == $accuracy")


------ Transforming ------
Elapsed time [transform]: 2.593s
+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.1416745483875...|[0.85832545161247...|       0.0|
|  0.0|[-0.0747678875923...|[0.92523211240768...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0145334601402...|[0.98546653985977...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0457237958908...|[0.95427620410919...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
+-----+--------------------+--------------------+----------+
only showing top 10 rows


------Accuracy of Evaluation------
accuracy == 0.998757706

results = [label: float, feature_0: float ... 128 more fields]
evaluator = mcEval_d5e223cecfdc
accuracy = 0.9987577063864658


0.9987577063864658

## Save the model to disk and load model
We save the model to disk and then load it to memory. We can use the loaded model to do a new prediction.

In [13]:
xgbClassificationModel.write.overwrite.save("/data/model/agaricus")

val modelFromDisk = XGBoostClassificationModel.load("/data/model/agaricus")
val (results2, _) = Benchmark.time("transform2") {
  modelFromDisk.transform(evalSet)
}
results2.select(labelName, "rawPrediction", "probability", "prediction").show(10)

Elapsed time [transform2]: 0.053s
+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.1416745483875...|[0.85832545161247...|       0.0|
|  0.0|[-0.0747678875923...|[0.92523211240768...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0145334601402...|[0.98546653985977...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
|  0.0|[-0.0457237958908...|[0.95427620410919...|       0.0|
|  1.0|[-0.9667758941650...|[0.03322410583496...|       1.0|
+-----+--------------------+--------------------+----------+
only showing top 10 rows



modelFromDisk = xgbc_00511571b736
results2 = [label: float, feature_0: float ... 128 more fields]


[label: float, feature_0: float ... 128 more fields]

In [14]:
spark.close()