### One-vs-Rest classifier 

OneVsRest is implemented as an Estimator. For the base classifier, it takes instances of Classifier and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName("one").getOrCreate()

In [5]:
df = spark.read.format("libsvm").load("newd/iris.txt")

In [6]:
train,test=df.randomSplit([0.8,0.2])

In [7]:
lr = LogisticRegression(maxIter = 10,tol=1E-6,fitIntercept=True)

In [8]:
ovr = OneVsRest(classifier=lr)

In [9]:
ovrModel = ovr.fit(train)

In [10]:
predictions =ovrModel.transform(test)

In [11]:
evaluator =MulticlassClassificationEvaluator(metricName = "accuracy")

In [12]:
accuracy = evaluator.evaluate(predictions)

In [13]:
print("test error=%g" % (1.0-accuracy))

test error=0.0769231


In [14]:
print(accuracy)

0.9230769230769231


### Naive Bayes

Naive Bayes classifiers are a family of simple probabilistic, multiclass classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between every pair of features.

Naive Bayes can be trained very efficiently. With a single pass over the training data, it computes the conditional probability distribution of each feature given each label. For prediction, it applies Bayes’ theorem to compute the conditional probability distribution of each label given an observation.

In [15]:
from pyspark.ml.classification import NaiveBayes

In [16]:
dt = spark.read.format("libsvm").load("newd/sample_libsvm.txt")

In [17]:
splits = dt.randomSplit([0.7,0.3],1234)

In [18]:
trains = splits[0]
tests = splits[1]

In [19]:
nb = NaiveBayes(smoothing=1.0,modelType="multinomial")

In [21]:
model = nb.fit(trains)

In [22]:
predictions = model.transform(tests)

In [24]:
predictions.show(4)

+-----+--------------------+--------------------+-----------+----------+
|label|            features|       rawPrediction|probability|prediction|
+-----+--------------------+--------------------+-----------+----------+
|  0.0|(692,[95,96,97,12...|[-173266.38465085...|  [1.0,0.0]|       0.0|
|  0.0|(692,[98,99,100,1...|[-176798.24796349...|  [1.0,0.0]|       0.0|
|  0.0|(692,[122,123,124...|[-189371.23080028...|  [1.0,0.0]|       0.0|
|  0.0|(692,[126,127,128...|[-210969.37526481...|  [1.0,0.0]|       0.0|
+-----+--------------------+--------------------+-----------+----------+
only showing top 4 rows



In [25]:
evaluator = MulticlassClassificationEvaluator(labelCol='label',predictionCol = "prediction",metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

In [26]:
print(accuracy)

1.0
