# SVM with PySpark

This notebook creates and measures a [Linear SVM](https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#linear-support-vector-machines-svms) classifier with PySpark.

* Method: LinearSVM with [SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* Dataset: Spark MLlib Sample SVM Data

## Imports

In [None]:
import findspark
findspark.init()

import numpy as np

from pyspark import SparkContext
from pyspark.sql import SQLContext

from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import BinaryClassificationMetrics

import matplotlib.pyplot as plt
%matplotlib inline

## Get Some Context

In [None]:
# Create a SparkContext and a SQLContext context to use
sc = SparkContext(appName="SVM Classification with Spark")
sqlContext = SQLContext(sc)

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/mllib/sample_svm_data.txt"

In [None]:
def parse_point(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

In [None]:
# Load the training data
raw_data = sc.textFile(DATA_FILE)
data = raw_data.map(parse_point)

In [None]:
# View one of the records
data.take(1)

In [None]:
# Create train and test datasets
splits = data.randomSplit([0.8, 0.2], 42)
train = splits[0]
test = splits[1]

## Fit an SVM Model

In [None]:
# Train the SVM model
model = SVMWithSGD.train(train,
                         iterations=100)

In [None]:
# Show the intercepts
print("Intercept: {}".format(model.intercept))

## Create Predictions

In [None]:
# Create the predictions and convert them to floats to match the datatype
labels_and_predictions = test.map(lambda p: (p.label, float(model.predict(p.features))))

In [None]:
# Convert labels_and_predictions and the test RDD to dataframes
lp_df = sqlContext.createDataFrame(labels_and_predictions, ["label", "predicted"])
test_df = sqlContext.createDataFrame(test, ["features", "label"])

# Make sure they have the same number of records
print(lp_df.count(), test_df.count())

In [None]:
# Check the dataframes
print(lp_df.show(5))
print(test_df.show(5))

In [None]:
# Create a plot to compare the actuals (labels) and predictions
actuals = lp_df.rdd.map(lambda r: r.label).collect()
predictions = lp_df.rdd.map(lambda r: float(r.predicted)).collect()


fig = plt.figure(figsize=(10,5))
plt.scatter(actuals, predictions)
plt.xlabel("Actuals")
plt.ylabel("Predictions")
plt.title("Actuals vs. Predictions")
plt.show()

## Model Evaluation

### Training Error

Calculate the training error

In [None]:
training_error = labels_and_predictions.filter(lambda lp: lp[0] != lp[1]).count() / float(test.count())
print("Training Error = %.2f" % training_error)

In [None]:
metrics = BinaryClassificationMetrics(labels_and_predictions)

### Area Under ROC

A measure of how well a parameter can distinguish between the two groups in a binary classification.

* .90-1 = excellent (A)
* .80-.90 = good (B)
* .70-.80 = fair (C)
* .60-.70 = poor (D)
* .50-.60 = fail (F)

In [None]:
print("Area Under ROC = %.2f" % metrics.areaUnderROC)

### Precision-Recall Curve

The tradeoff between precision and recall.

* Higher = high recall (low false negative rate) and high precision (low false positive rate)
* Lower = low recall (high false negative rate) and low precision (high false positive rate)

In [None]:
print("Area Under PR = %.2f" % metrics.areaUnderPR)

## Shut it Down

In [None]:
sc.stop()