# Running XGBoost on Azure HDInsight
This Notebook will walk you through on the detailed steps on how to build, install, and run XGBoost on HDInsight, the managed Hadoop and Spark solution on Azure.
![XGBoost](https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png) 

### XGBoost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

It is not designed as a generic Machine Learning framework; it is designed as a library very specialized in boosting tree algorithm, and is widely used from production to experimental projects.

For more details on XGBoost, please go to XGBoost [GitHub page](https://github.com/dmlc/xgboost).



### How to use this notebook
This notebook basically provides an E2E workflow from building XGBoost jars, deploying the jars to Azure Storage, to running Boosting Tree algorithm to HDInsight.

### Building XGBoost from source code
The following code snippet 

- installs the required libraries for building XGBoost
- builds XGBoost using Maven
- put the compiled jars to the default storage account of the HDInsight cluster
- put the sample data to the default storage account of the HDInsight cluster

The cell below is using the %%sh magic which will execute the code below as bash scripts in the head node.

You might see something like this when building xgboost. This is expected and the final test should pass.

    Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.0.0.15, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=4}
    17/08/14 22:41:34 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
    java.lang.RuntimeException: Worker exception.
            at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$2.apply(RabitTrackerRobustnessSuite.scala:72)
            at ml.dmlc.xgboost4j.scala.spark.RabitTrackerRobustnessSuite$$anonfun$1$$anonfun$2.apply(RabitTrackerRobustnessSuite.scala:66)
            at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)


In [6]:
%%sh
sudo apt-get update
sudo apt-get install -y maven git build-essential cmake python-setuptools
git clone --recursive https://github.com/dmlc/xgboost

#builds XGBoost using Maven
cd xgboost/jvm-packages
mvn -DskipTests=true install

#put the compiled packge to shared storage
#put to root folder for simplicity
hadoop fs -put -f xgboost4j-spark/target/xgboost4j-spark-0.7.jar /
hadoop fs -put -f xgboost4j/target/xgboost4j-0.7.jar /
hadoop fs -put -f xgboost4j-example/target/xgboost4j-example-0.7.jar /


#put the sample data to shared storage
hadoop fs -put -f ..//demo/data/agaricus.txt* /

Process is terminated.


### Start a Spark session
After putting the jars and the files to the Azure Storage, which is shared across all the HDInsight nodes, the next step is to start a Spark session and call the XGBoost libraries. 

In the configure cell below, first we need to load those jar files to the Spark session, so we can use XGBoost APIs in this Jupyter Notebook.

We also need to exclude a few spark jars because there are some conflicts between Livy (which is the REST API used on HDInsight to execute Spark code), and XGBoost.

In [1]:
%%configure -f
{ "jars": ["wasb:///xgboost4j-spark-0.7.jar", "wasb:///xgboost4j-0.7.jar", "wasb:///xgboost4j-example-0.7.jar"],
  "conf": {
    "spark.jars.excludes": "org.scala-lang:scala-reflect:2.11.8,org.scala-lang:scala-compiler:2.11.8,org.scala-lang:scala-library:2.11.8"
   }
}

### Import Packages
We then import the XGBoost packages and start a Spark application

In [2]:
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1502756750987_0005,spark,idle,Link,Link,✔


SparkSession available as 'spark'.
import org.apache.spark.SparkConf

### Train a simple XGBoost model
We read data from the default BLOB storage account. The data is already put there if you run the %%sh cell above. But you can also get the data from [XGBoost repo](https://github.com/dmlc/xgboost/tree/master/demo/data).

In [17]:
// create training and testing dataframes
val inputTrainPath = "wasb:///agaricus.txt.train"
val inputTestPath = "wasb:///agaricus.txt.test"
val outputModelPath = "wasb:///XGBoostModelOutput"
val numWorkers = 4

// number of iterations
val numRound = 100

// build dataset
val trainDF = spark.sqlContext.read.format("libsvm").load(inputTrainPath)
val testDF = spark.sqlContext.read.format("libsvm").load(inputTestPath)
// start training
val paramMap = List(
  "eta" -> 0.1f,
  "max_depth" -> 6,
  "objective" -> "binary:logistic").toMap

val xgboostModel = XGBoost.trainWithDataFrame(
  trainDF, paramMap, numRound, nWorkers = numWorkers, useExternalMemory = true)

xgboostModel: ml.dmlc.xgboost4j.scala.spark.XGBoostModel = XGBoostClassificationModel_07caca627526

### Train a simple XGBoost model
Transform the test data and look at the results

In [19]:
// xgboost-spark appends the column containing prediction results
xgboostModel.transform(testDF).show()

+-----+--------------------+--------------------+----------+
|label|            features|       probabilities|prediction|
+-----+--------------------+--------------------+----------+
|  0.0|(126,[0,8,18,20,2...|[0.99930757284164...|       0.0|
|  1.0|(126,[2,8,18,20,2...|[0.00261396169662...|       1.0|
|  0.0|(126,[0,8,19,20,2...|[0.99930757284164...|       0.0|
|  0.0|(126,[2,8,18,20,2...|[0.99930757284164...|       0.0|
|  0.0|(126,[3,6,10,21,2...|[0.99719786643981...|       0.0|
|  0.0|(126,[2,9,19,20,2...|[0.99351829290390...|       0.0|
|  1.0|(126,[2,8,10,20,2...|[0.00236105918884...|       1.0|
|  0.0|(126,[0,8,19,20,2...|[0.99931138753890...|       0.0|
|  1.0|(126,[2,8,18,20,2...|[0.00265699625015...|       1.0|
|  0.0|(126,[3,8,19,20,2...|[0.99906325340271...|       0.0|
|  1.0|(126,[2,8,10,20,2...|[0.00236105918884...|       1.0|
|  0.0|(126,[0,8,19,20,2...|[0.99930530786514...|       0.0|
|  0.0|(126,[3,6,13,21,2...|[0.99980139732360...|       0.0|
|  0.0|(126,[0,8,18,20,2

### Explain Parameters
Let's also take a look at the parameters

In [18]:
xgboostModel.explainParams()

res47: String =
booster: Booster to use, options: {'gbtree', 'gblinear', 'dart'} (default: gbtree, current: gbtree)
alpha: L1 regularization term on weights, increase this value will make model more conservative. (default: 0.0, current: 0.0)
booster: Booster to use, options: {'gbtree', 'gblinear', 'dart'} (default: gbtree, current: gbtree)
colsample_bylevel: subsample ratio of columns for each split, in each level. (default: 1.0, current: 1.0)
colsample_bytree: subsample ratio of columns when constructing each tree. (default: 1.0, current: 1.0)
eta: step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. (default: 0.3...

### Save the model to Azure Storage
XGBoost can save the model to Azure Storage. We need to specify the implicit value sc, which is required by the saveModelAsHadoopFile API. It is the sparkContext type so we need to get it from the default spark (which is of sparkSession type)

In [20]:
//set sc value which is required by the saveModelAsHadoopFile API. It is the sparkContext type so we need to get it from the default spark (which is of sparkSession type)
implicit val sc = spark.sparkContext
xgboostModel.saveModelAsHadoopFile(outputModelPath)