# Spark - Classification - Naive Bayes Classifier ML Pipelines

## Algorithm Summary
- Task : Classification with binary or multiclass labels
- Input : Label (binary or multiclass, 0-based indexed), feature vectors(discrete)
- Smoothing : Additive smothing, default parameter is set to 1.0
- Model type : Multinomial (default) or Bernouli. to use Bernouli, convert feature vectors to 0-1 vectors and set modelType to "Bernouli"
- Assumptions:
    - Independence between every pair of features
    - Feature values are nonnegative, such as counts


## Data Analysis Example
- <a href="https://github.com/vincentarelbundock/Rdatasets/blob/master/csv/datasets/iris.csv">Iris dataset</a>
- Make a connection to spark cluster
- Dataset Review
- Load Data & Data preprocessing
- Explore the data
- Create a multiclass naive Bayes Classifier and Evaluation
- Experimenting with Various Smoothing Parameters

## Dataset Review
The dataset contains 3 species of iris, there are Setosa, Versicolor and Virginica with 50 instances of each. in this example, we are going to try to predict the species from its features.

Feature Information:
1. Sepal Length in cm
2. Sepal Width in cm
3. Petal Length in cm
4. Petal Width in cm

Target and Label :
- Species
    - Setosa
    - Versicolor
    - Virginica
    
## References :
<a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3783546674231736/4413065072037724/latest.html">Naive Bayes Classifier - ML Pipelines</a>


In [1]:
#import libraries from python
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.ml.feature import StringIndexer

In [2]:
#get some context.
#create a SparkContext and a SQLContext to use
conf = SparkConf()
conf.setMaster("spark://sparklab-master:7077")
conf.setAppName("Spark Classification with Naive Bayes - Iris Datasets")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext=SQLContext(sc)

## Load Data and Data Preprocessing

In [3]:
data_file = "/home/twster/Spark/Projects/datasets/iris2.csv"

we use csv format datasets and load the dataset with sqlCOntext format

In [4]:
#create a dataframe
#stored in a MYSQL database
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(data_file)
data.show()

+-----------+----------+-----------+----------+-------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|
+-----------+----------+-----------+----------+-------+
|        5.1|       3.5|        1.4|       0.2| setosa|
|        4.9|       3.0|        1.4|       0.2| setosa|
|        4.7|       3.2|        1.3|       0.2| setosa|
|        4.6|       3.1|        1.5|       0.2| setosa|
|        5.0|       3.6|        1.4|       0.2| setosa|
|        5.4|       3.9|        1.7|       0.4| setosa|
|        4.6|       3.4|        1.4|       0.3| setosa|
|        5.0|       3.4|        1.5|       0.2| setosa|
|        4.4|       2.9|        1.4|       0.2| setosa|
|        4.9|       3.1|        1.5|       0.1| setosa|
|        5.4|       3.7|        1.5|       0.2| setosa|
|        4.8|       3.4|        1.6|       0.2| setosa|
|        4.8|       3.0|        1.4|       0.1| setosa|
|        4.3|       3.0|        1.1|       0.1| setosa|
|        5.8|       4.0|        1.2|       0.2| 

## Explore data

In [5]:
# Convert target into numerical categories
labelIndexer = StringIndexer(inputCol="Species", outputCol="label")

Split row and just show SepalLength and Species

In [6]:
data.select("SepalLength","Species").show()

+-----------+-------+
|SepalLength|Species|
+-----------+-------+
|        5.1| setosa|
|        4.9| setosa|
|        4.7| setosa|
|        4.6| setosa|
|        5.0| setosa|
|        5.4| setosa|
|        4.6| setosa|
|        5.0| setosa|
|        4.4| setosa|
|        4.9| setosa|
|        5.4| setosa|
|        4.8| setosa|
|        4.8| setosa|
|        4.3| setosa|
|        5.8| setosa|
|        5.7| setosa|
|        5.4| setosa|
|        5.1| setosa|
|        5.7| setosa|
|        5.1| setosa|
+-----------+-------+
only showing top 20 rows



In [7]:
#take raw data from iris2 dataset.
data.take(1)

[Row(SepalLength=5.1, SepalWidth=3.5, PetalLength=1.4, PetalWidth=0.2, Species=u'setosa')]

In [8]:
#train the model
#Split the data into train and test
# To proceed, we will first randomly split the dataset into training set (70%) and test set (30%).
trainData, testData = data.randomSplit([0.7, 0.3], seed = 100)

In [9]:
trainData.show(10)

+-----------+----------+-----------+----------+-------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|
+-----------+----------+-----------+----------+-------+
|        4.3|       3.0|        1.1|       0.1| setosa|
|        4.4|       2.9|        1.4|       0.2| setosa|
|        4.4|       3.0|        1.3|       0.2| setosa|
|        4.4|       3.2|        1.3|       0.2| setosa|
|        4.6|       3.1|        1.5|       0.2| setosa|
|        4.6|       3.2|        1.4|       0.2| setosa|
|        4.7|       3.2|        1.6|       0.2| setosa|
|        4.8|       3.0|        1.4|       0.1| setosa|
|        4.8|       3.0|        1.4|       0.3| setosa|
|        4.8|       3.1|        1.6|       0.2| setosa|
+-----------+----------+-----------+----------+-------+
only showing top 10 rows



In [10]:
testData.show(10)

+-----------+----------+-----------+----------+-------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|
+-----------+----------+-----------+----------+-------+
|        4.5|       2.3|        1.3|       0.3| setosa|
|        4.6|       3.4|        1.4|       0.3| setosa|
|        4.6|       3.6|        1.0|       0.2| setosa|
|        4.7|       3.2|        1.3|       0.2| setosa|
|        4.8|       3.4|        1.9|       0.2| setosa|
|        4.9|       3.0|        1.4|       0.2| setosa|
|        4.9|       3.1|        1.5|       0.1| setosa|
|        4.9|       3.6|        1.4|       0.1| setosa|
|        5.0|       3.2|        1.2|       0.2| setosa|
|        5.0|       3.3|        1.4|       0.2| setosa|
+-----------+----------+-----------+----------+-------+
only showing top 10 rows



In [11]:
print("Counting Training Data : {}".format(trainData.count()))

Counting Training Data : 103


In [12]:
print("Counting Test Data : {}".format(testData.count()))

Counting Test Data : 47


In [13]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler

In [14]:
vecAssembler = VectorAssembler(inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"], outputCol="features")

In [15]:
#Train a NaiveBayes Model
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

#chain LabelIndexer, vecAssembler and NBmodel in a
pipeline = Pipeline(stages=[labelIndexer, vecAssembler, nb])

#Run stages in pipeline and train model
model = pipeline.fit(trainData)

In [16]:
# Make predictions on testData so we can measure the accuracy of our model on new data
predictions = model.transform(testData)
print("Show the predictions")
predictions.select("label","prediction","probability").show()

Show the predictions
+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  2.0|       2.0|[0.29698237383357...|
|  2.0|       2.0|[0.21798511589122...|
|  2.0|       2.0|[0.15141773405828...|
|  2.0|       2.0|[0.20436085847774...|
|  2.0|       2.0|[0.24788699074567...|
|  2.0|       2.0|[0.22533897917645...|
|  2.0|       2.0|[0.21072257132948...|
|  2.0|       2.0|[0.16530848479213...|
|  2.0|       2.0|[0.18841991732741...|
|  2.0|       2.0|[0.20023264276145...|
|  2.0|       2.0|[0.28149718404605...|
|  2.0|       2.0|[0.20037174080684...|
|  2.0|       2.0|[0.18351954523753...|
|  2.0|       2.0|[0.24433103432406...|
|  2.0|       2.0|[0.18864884421606...|
|  2.0|       2.0|[0.17451529522505...|
|  2.0|       2.0|[0.17250580082906...|
|  0.0|       0.0|[0.49917720827077...|
|  2.0|       2.0|[0.16640704731458...|
|  0.0|       0.0|[0.50561466273043...|
+-----+----------+--------------------+
only showing top 20

In [17]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", 
                                              predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print "Model Accurary : {}".format(accuracy)

Model Accurary : 0.936170212766


In [18]:
evaluator.explainParam("metricName")

'metricName: metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy) (default: f1, current: accuracy)'

In [19]:
from pyspark.mllib.evaluation import MulticlassMetrics
# Create (prediction, label) pairs
predictionAndLabel = predictions.select("prediction", "label").rdd

# Generate confusion matrix
metrics = MulticlassMetrics(predictionAndLabel)
print metrics.confusionMatrix()

DenseMatrix([[ 13.,   0.,   0.],
             [  3.,  12.,   0.],
             [  0.,   0.,  19.]])


# Experimenting with Various Smoothing Parameters

In [20]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid and Evaluator for Cross Validation
paramGrid = ParamGridBuilder().addGrid(nb.smoothing, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="accuracy")

In [21]:
# Run Cross-validation
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator)
cvModel = cv.fit(trainData)

In [22]:
# Make predictions on testData. cvModel uses the bestModel.
cvPredictions = cvModel.transform(testData)

In [23]:
#select results to view
cvPredictions.select("label", "prediction", "probability").show()

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  2.0|       2.0|[0.30030694251837...|
|  2.0|       2.0|[0.22025422958423...|
|  2.0|       2.0|[0.15149173803482...|
|  2.0|       2.0|[0.20518906865455...|
|  2.0|       2.0|[0.24956972908126...|
|  2.0|       2.0|[0.22644817576129...|
|  2.0|       2.0|[0.21050751177550...|
|  2.0|       2.0|[0.16461564455715...|
|  2.0|       2.0|[0.18893524268040...|
|  2.0|       2.0|[0.20098739805581...|
|  2.0|       2.0|[0.28730389628984...|
|  2.0|       2.0|[0.20115179465289...|
|  2.0|       2.0|[0.18403835547291...|
|  2.0|       2.0|[0.24841050465935...|
|  2.0|       2.0|[0.18921187643046...|
|  2.0|       2.0|[0.17492305771120...|
|  2.0|       2.0|[0.17286890844186...|
|  0.0|       0.0|[0.50177370035199...|
|  2.0|       2.0|[0.16659681368935...|
|  0.0|       0.0|[0.50958191840212...|
+-----+----------+--------------------+
only showing top 20 rows



In [24]:
# Evaluate bestModel found from Cross Validation
evaluator.evaluate(cvPredictions)

0.9574468085106383