By the end of this activity, you will be able to perform the following in Spark:

Determine the accuracy of a classifier model
Display the confusion matrix for a classifier model
In this activity, you will be programming in a Jupyter Python Notebook. If you have not already started the Jupyter Notebook server, see the instructions in the Reading Instructions for Starting Jupyter.

Step 1. Open Jupyter Python Notebook. Open a web browser by clicking on the web browser icon at the top of the toolbar:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/RCneZE7PEeaqTxIkdCEfsw_c491f272226b35805e44abef7a7a22a9_browser-icon.png?expiry=1507852800000&hmac=Tm1M4Bmovuy3gsD3GcbtEhO0sFTqQnTc_CuVCnskA-w)

Navigate to localhost:8889/tree/Downloads/big-data-4:

![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/9Zu58oqhEeaKKwpaECzIKQ_361b99533aaa8d7cde3e3df56b69b3f5_browser.png?expiry=1507852800000&hmac=xdwPQiACz7sjm7XYxYhIJHxEGIQ13eig6Gzhc1rtnw4)
Open the model evaluation notebook by clicking on model-evaluation.ipynb:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/BfjlhovoEea63BLi-G7oTw_f95938970c4788aac115fe4c8866d0c6_notebook.png?expiry=1507852800000&hmac=suhudiH4e-Vj-aN_Cz4pTfCYC1zwLzmYvHnwCvgKZU4)

Step 2. Load predictions. Execute the first cell to load the classes used in this activity:

In [1]:
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

Execute the next cell to load the predictions CSV file that we created at the end of the Week 3 Hands-On Classification in Spark into a DataFrame:

In [2]:
sqlContext = SQLContext(sc)
predictions = sqlContext.read.load('file:///home/cloudera/Downloads/big-data-4/prediction.csv', 
                          format='com.databricks.spark.csv', 
                          header='true',inferSchema='true')

Step 3. Compute accuracy. Let's create an instance of MulticlassClassificationEvaluator to determine the accuracy of the predictions:

In [3]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label",predictionCol="prediction",metricName="precision")

In [4]:
evaluator

MulticlassClassificationEvaluator_4d7f8e0d2dd97cb2d5fa

The first two arguments specify the names of the label and prediction columns, and the third argument specifies that we want the overall precision.

We can compute the accuracy by calling evaluate():

In [12]:
accuracy = evaluator.evaluate(predictions)
print ("Accuracy = %.2g" % ( accuracy ))

Accuracy = 0.81


Step 4. Display confusion matrix. The MulticlassMetrics class can be used to generate a confusion matrix of our classifier model. However, unlike MulticlassClassificationEvaluator, MulticlassMetrics works with RDDs of numbers and not DataFrames, so we need to convert our predictions DataFrame into an RDD.

If we use the rdd attribute of predictions, we see this is an RDD of Rows:



In [5]:
predictions.rdd.take(2)

[Row(prediction=1.0, label=1.0), Row(prediction=1.0, label=1.0)]

Instead, we can map the RDD to tuple to get an RDD of numbers:



In [6]:
predictions.rdd.map(tuple).take(2)

[(1.0, 1.0), (1.0, 1.0)]

Let's create an instance of MulticlassMetrics with this RDD:



In [7]:
metrics = MulticlassMetrics(predictions.rdd.map(tuple))

NOTE: the above command can take longer to execute than most Spark commands when first run in the notebook.

The confusionMatrix() function returns a Spark Matrix, which we can convert to a Python Numpy array, and transpose to view:

The confusionMatrix() function returns a Spark Matrix, which we can convert to a Python Numpy array, and transpose to view:

In [8]:
metrics.confusionMatrix().toArray().transpose()

array([[ 87.,  26.],
       [ 14.,  83.]])

**Q**

Spark: In the last line of code in Step 4, the confusion matrix is printed out. If the “transpose()” is removed, the confusion matrix will be displayed as:


In [13]:
metrics.confusionMatrix().toArray()


array([[ 87.,  14.],
       [ 26.,  83.]])