<a href="https://colab.research.google.com/github/ralsouza/apache_spark_real_time_analytics/blob/master/notebooks/10_pyspark_mllib_naive_bayes_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark - MLLib - Classification - Naive Bayes
**Description:** 
*   Based on Bayes' theorem;
*   The probability of an `Event A = P(A) is between 0 and 1`;
*   The probability `P(A/B) = P(A intersect B) * P(A) / P(B)`;
*   The `target variable` becomes the `Event A`;
*   The model stores the conditional probability of the target variable for each possible value from predictor variables;

**Advantages:**
*   Faster and simple;
*   Works well even with missing values;
*   Provides the probability of an outcome;
*   Excelent for categoric variables;

**Disadvantages:**
*   Doesn't work well with many numeric variables;
*   Expects that the variables predict variables are independent;

**Aplication:**
*   Spam filter;
*   Medical diagnosis;
*   Document classification;
    




# Spark Setup

In [None]:
!apt-get update

In [3]:
# Install the dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [5]:
# Environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [6]:
# Make pyspark "importable"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [7]:
# Libraries and Context Setup
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [8]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)


# Instance Spark Session
spark = SparkSession.builder.master('local').appName('spark_ml_lib').getOrCreate()

# Create the SQL Context
sqlContext = pyspark.SQLContext(sc)

# Spam classification
Dataset: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [10]:
# Import libraries
from pyspark.ml                import Pipeline
from pyspark.ml.feature        import IDF, Tokenizer, HashingTF
from pyspark.ml.classification import NaiveBayes, NaiveBayesModel
from pyspark.ml.evaluation     import MulticlassClassificationEvaluator

In [11]:
# Spark Session
sp_session = SparkSession.builder.master('local').appName('spark_mllib').getOrCreate()

In [12]:
# Load data
rdd_spam = sc.textFile('/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/mllib/SMSSpamCollection.csv')

In [13]:
rdd_spam.cache()

/content/drive/My Drive/Colab Notebooks/08-apache-spark/data/mllib/SMSSpamCollection.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [14]:
rdd_spam.collect()[:10]

['ham,Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,,,,,,,,,',
 'ham,Ok lar... Joking wif u oni...,,,,,,,,,,',
 'ham,U dun say so early hor... U c already then say...,,,,,,,,,,',
 "ham,Nah I don't think he goes to usf, he lives around here though,,,,,,,,,",
 'ham,Even my brother is not like to speak with me. They treat me like aids patent.,,,,,,,,,,',
 "ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,,,,,,,,,,",
 "ham,I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.,,,,,,,,,",
 "ham,I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.,,,,,,,,,,",
 'ham,I HAVE A DATE ON SUNDAY WITH WILL!!,,,,,,,,,,',
 "ham

In [15]:
rdd_spam.count()

1000

# Preprocessing data

In [16]:
def transform_to_vector(input_str):
  att_list = input_str.split(',')
  sms_type = 0.0 if att_list[0] == 'ham' else 1.0
  return [sms_type,att_list[1]]

In [17]:
# Apply function
rdd_spam2 = rdd_spam.map(transform_to_vector)

In [18]:
# Transform to dataframe
df_spam = sp_session.createDataFrame(rdd_spam2,['label','message'])

In [19]:
df_spam.cache()

DataFrame[label: double, message: string]

In [21]:
df_spam.select('label','message').show()

+-----+--------------------+
|label|             message|
+-----+--------------------+
|  0.0|Go until jurong p...|
|  0.0|Ok lar... Joking ...|
|  0.0|U dun say so earl...|
|  0.0|Nah I don't think...|
|  0.0|Even my brother i...|
|  0.0|As per your reque...|
|  0.0|I'm gonna be home...|
|  0.0|I've been searchi...|
|  0.0|I HAVE A DATE ON ...|
|  0.0|Oh k...i'm watchi...|
|  0.0|Eh u remember how...|
|  0.0|Fine if thats th...|
|  0.0|Is that seriously...|
|  0.0|I‘m going to try ...|
|  0.0|So ü pay first la...|
|  0.0|Aft i finish my l...|
|  0.0|Ffffffffff. Alrig...|
|  0.0|Just forced mysel...|
|  0.0|Lol your always s...|
|  0.0|Did you catch the...|
+-----+--------------------+
only showing top 20 rows



In [25]:
# Split in words and apply TF-IDF (TF = Term Frequency)
tokenizer = Tokenizer(inputCol='message',outputCol='words')
hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(),outputCol='temp_features')
idf = IDF(inputCol=hashing_tf.getOutputCol(),outputCol='features')

# Machine Learning

In [26]:
# Split data
(train_data,test_data) = df_spam.randomSplit([0.7,0.3])

In [27]:
train_data.count()

699

In [28]:
test_data.count()

301

In [29]:
# Model creation
nb_classifier = NaiveBayes()

In [30]:
# Pipeline creation
pipeline = Pipeline(stages=[tokenizer,hashing_tf,idf,nb_classifier])

In [31]:
# Model training with pipeline
model = pipeline.fit(train_data)

In [38]:
# Predicting with test data
prediction = model.transform(test_data)
prediction.select('prediction','label').collect()[:10]

[Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0)]

In [39]:
# Accuracy evaluation
evaluation = MulticlassClassificationEvaluator(predictionCol='prediction',labelCol='label',metricName='accuracy')

In [40]:
evaluation.evaluate(prediction)

0.9235880398671097

In [41]:
# Confusion matrix
prediction.groupBy('label','prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|  156|
|  0.0|       1.0|   16|
|  1.0|       0.0|    7|
|  0.0|       0.0|  122|
+-----+----------+-----+

