# Running Spark with MLlib Machine Learning Library

Example from:
https://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

The Spark session was already started when we started the Spark job

Lets see the SparkContext

In [1]:
sc.setLogLevel("ERROR")
sc

---

Lets make sure Spark can see all the cores

We have two 36-core compute nodes with 72 cores in total

In [2]:
sc.defaultParallelism

72

---

Lets load the txt file to a Spark DataFrame

In [7]:
from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")
data

                                                                                

PythonRDD[9] at RDD at PythonRDD.scala:53

---

Lets use MLlib to split the data to training and test sets

In [8]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [9]:
# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainClassifier(trainingData,
                                             categoricalFeaturesInfo={}, numIterations=3)

                                                                                

In [10]:
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))

Test Error = 0.037037037037037035


In [11]:
# Save and load model
model.save(sc, "target/myGradientBoostingClassificationModel")
sameModel = GradientBoostedTreesModel.load(sc,
                                           "target/myGradientBoostingClassificationModel")

                                                                                