<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center">Lecture for Week 7</h3>
<h3 style="text-align:center"> Spark Classification (Definitive 26)</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

## Classification Models in MLlib
Spark has several models available for performing binary and multiclass classification out of the box.
The following models are available for classification in Spark:
* Logistic regression
* Decision trees
* Random forests
* Gradient-boosted trees

Spark does not support making multilabel predictions natively. In order to train a multilabel model,
you must train one model per label and combine them manually. Once manually constructed, there are
built-in tools that support measuring these kinds of models

Let’s start looking at the classification models by loading in some data:

In [None]:
bInput = spark.read.format("parquet").load("gs://info323-ya45-spring2020-bucket/data/binary-classification")\
  .selectExpr("features", "cast(label as double) as label")

In [None]:
bInput.printSchema()

In [None]:
bInput.show(3, False)

## LogisticRegression
Here’s a simple example using the LogisticRegression model. Notice how we didn’t specify any
parameters because we’ll leverage the defaults and our data conforms to the proper column naming.
In practice, you probably won’t need to change many of the parameters:

In [None]:
# COMMAND ----------

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()
print lr.explainParams() # see all parameters

In [None]:
lrModel = lr.fit(bInput)

Once the model is trained you can get information about the model by taking a look at the coefficients
and the intercept. The coefficients correspond to the individual feature weights (each feature weight
is multiplied by each respective feature to compute the prediction) while the intercept is the value of
the italics-intercept (if we chose to fit one when specifying the model). Seeing the coefficients can be
helpful for inspecting the model that you built and comparing how features affect the prediction:

In [None]:
# COMMAND ----------

print lrModel.coefficients

In [None]:
print lrModel.intercept

## Model Summary
Logistic regression provides a model summary that gives you information about the final, trained
model.

In [None]:
# COMMAND ----------

summary = lrModel.summary
print summary.areaUnderROC

In [None]:
summary.roc.show()

In [None]:
summary.pr.show()

The speed at which the model descends to the final result is shown in the objective history. We can
access this through the objective history on the model summary:

In [None]:
# COMMAND ----------

summary.objectiveHistory

## Decision Tree
Here’s a minimal but complete example of using a decision tree classifier:

In [None]:
# COMMAND ----------

from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier()
print dt.explainParams()

In [None]:
dtModel = dt.fit(bInput)

In [None]:
summary_dt = dtModel.summary

## Random Forest and Gradient-Boosted Trees
These methods are extensions of the decision tree. Rather than training one tree on all of the data, you
train multiple trees on varying subsets of the data. The intuition behind doing this is that various
decision trees will become “experts” in that particular domain while others become experts in others.
By combining these various experts, you then get a “wisdom of the crowds” effect, where the group’s
performance exceeds any individual. In addition, these methods can help prevent overfitting.

In [None]:
# COMMAND ----------

from pyspark.ml.classification import RandomForestClassifier
rfClassifier = RandomForestClassifier()
print rfClassifier.explainParams()

In [None]:
trainedModel = rfClassifier.fit(bInput)

## GBT

In [None]:
# COMMAND ----------

from pyspark.ml.classification import GBTClassifier
gbtClassifier = GBTClassifier()
print gbtClassifier.explainParams()

In [None]:
trainedModel = gbtClassifier.fit(bInput)

## Naive Bayes
Naive Bayes classifiers are a collection of classifiers based on Bayes’ theorem. The core assumption
behind the models is that all features in your data are independent of one another.

In [None]:
# COMMAND ----------

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.explainParams()

In [None]:
trainedModel = nb.fit(bInput.where("label != 0"))

## Detailed Evaluation Metrics
MLlib also contains tools that let you evaluate multiple classification metrics at once. Unfortunately,
these metrics classes have not been ported over to Spark’s DataFrame-based ML package from the
underlying RDD framework. So, at the time of this writing, you still have to create an RDD to use
these. In the future, this functionality will likely be ported to DataFrames and the following may no
longer be the best way to see metrics (although you will still be able to use these APIs).
There are three different classification metrics we can use:
* Binary classification metrics
* Multiclass classification metrics
* Multilabel classification metrics

In [None]:
# COMMAND ----------

from pyspark.mllib.evaluation import BinaryClassificationMetrics
out = dtModel.transform(bInput)\
  .select("prediction", "label")\
  .rdd.map(lambda x: (float(x[0]), float(x[1])))

In [None]:
metrics = BinaryClassificationMetrics(out)

Once we’ve done that, we can see typical classification success metrics on this metric’s object using
a similar API to the one we saw with logistic regression:

In [None]:
# COMMAND ----------

print metrics.areaUnderPR

In [None]:
print metrics.areaUnderROC

In [None]:
print "Receiver Operating Characteristic"
metrics.roc.toDF().show()


# COMMAND ----------