# <font color='blue'>Data Science Academy Big Data Real-Time Analytics com Python e Spark</font>

# <font color='blue'>Capítulo 9</font>

## <font color='blue'>Spark MLLib - Classificação - Naive Bayes</font>

<strong> Descrição </strong>
<ul style="list-style-type:square">
  <li>Baseado no Teorema de Bayes.</li>
  <li>Probabilidade de um Evento A = P(A) é entre 0 e 1.</li>
  <li>A probabilidade P(A/B) = P(A intersect B ) * P (A) /P(B).</li>
  <li>A variável target se torna o evento A.</li>
  <li>O modelo armazena a probabilidade condicional da variável target para cada possível valor das variáveis preditoras.</li>
</ul>

<dl>
  <dt>Vantagens</dt>
  <dd>- Rápido e simples</dd>
  <dd>- Funciona bem mesmo com valores missing</dd>
  <dd>- Provê probabilidades de um resultado</dd>
  <dd>- Excelente com variáveis categóricas</dd>
  <br />
  <dt>Desvantagens</dt>
  <dd>- Não funciona bem com muitas variáveis numéricas</dd>
  <dd>- Espera que as variáveis preditoras sejam independentes</dd>
  <br />
  <dt>Aplicação</dt>
  <dd>- Filtro de Spam</dd>
  <dd>- Diagnóstico médico</dd>
  <dd>- Classificação de documentos</dd>
</dl>

## Classificação de Spam

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
from pyspark.ml.classification import NaiveBayes, NaiveBayesModel
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.feature import IDF
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
# Credenciais para conexão com o Hadoop
# @hidden_cell
credentials_1 = {
  'auth_url':'https://identity.open.softlayer.com',
  'project':'object_storage_f0d6ce32_5e0f_4bc0_8812_229b8d429dbe',
  'project_id':'9a0cc60102244d368e96a83f25d4ca89',
  'region':'dallas',
  'user_id':'0caf8026c98a4342ac027a05416e6dee',
  'domain_id':'3be46074545f4c09b1f10df3ace95998',
  'domain_name':'1351407',
  'username':'member_327b95c3eecf105b8bdb0125b81968cfcc557dbd',
  'password':"""D[Cvr1bgf9DM^I{C""",
  'container':'CursoSpark',
  'tenantId':'undefined',
  'filename':'SMSSpamCollection.csv'
}


In [3]:

from pyspark.sql import SparkSession

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_f0d6ce325e0f4bc08812229b8d429dbe(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', '9a0cc60102244d368e96a83f25d4ca89')
    hconf.set(prefix + '.username', '0caf8026c98a4342ac027a05416e6dee')
    hconf.set(prefix + '.password', 'D[Cvr1bgf9DM^I{C')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_f0d6ce325e0f4bc08812229b8d429dbe(name)

spark = SparkSession.builder.getOrCreate()

df_data_1 = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load('swift://CursoSpark.' + name + '/SMSSpamCollection.csv')



In [4]:
# Carregando os dados e gerando um RDD
spamRDD = sc.textFile('swift://CursoSpark.' + name + '/SMSSpamCollection.csv', 2)

In [5]:
spamRDD.cache()

swift://CursoSpark.keystone/SMSSpamCollection.csv MapPartitionsRDD[8] at textFile at NativeMethodAccessorImpl.java:-2

In [6]:
spamRDD.collect()

[u'ham,Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,,,,,,,,,',
 u'ham,Ok lar... Joking wif u oni...,,,,,,,,,,',
 u'ham,U dun say so early hor... U c already then say...,,,,,,,,,,',
 u"ham,Nah I don't think he goes to usf, he lives around here though,,,,,,,,,",
 u'ham,Even my brother is not like to speak with me. They treat me like aids patent.,,,,,,,,,,',
 u"ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,,,,,,,,,,",
 u"ham,I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.,,,,,,,,,",
 u"ham,I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.,,,,,,,,,,",
 u'ham,I HAVE A DATE ON SUNDAY WITH WILL!!,,,,,,,,,

## Pré-Processamento dos Dados

In [7]:
def TransformToVector(inputStr):
    attList = inputStr.split(",")
    smsType = 0.0 if attList[0] == "ham" else 1.0
    return [smsType, attList[1]]

In [9]:
spamRDD2 = spamRDD.map(TransformToVector)
spamDF = spark.createDataFrame(spamRDD2, ["label", "message"])
spamDF.cache()
spamDF.select("label", "message").show()

+-----+--------------------+
|label|             message|
+-----+--------------------+
|  0.0|Go until jurong p...|
|  0.0|Ok lar... Joking ...|
|  0.0|U dun say so earl...|
|  0.0|Nah I don't think...|
|  0.0|Even my brother i...|
|  0.0|As per your reque...|
|  0.0|I'm gonna be home...|
|  0.0|I've been searchi...|
|  0.0|I HAVE A DATE ON ...|
|  0.0|Oh k...i'm watchi...|
|  0.0|Eh u remember how...|
|  0.0|Fine if thats th...|
|  0.0|Is that seriously...|
|  0.0|I‘m going to try ...|
|  0.0|So ü pay first la...|
|  0.0|Aft i finish my l...|
|  0.0|Ffffffffff. Alrig...|
|  0.0|Just forced mysel...|
|  0.0|Lol your always s...|
|  0.0|Did you catch the...|
+-----+--------------------+
only showing top 20 rows



## Machine Learning

In [10]:
# Dados de Treino e de Teste
(dados_treino, dados_teste) = spamDF.randomSplit([0.7, 0.3])

In [11]:
dados_treino.count()

699

In [12]:
dados_teste.count()

301

In [13]:
# Divisão em palavras e aplicação do TF-IDF 
tokenizer = Tokenizer(inputCol = "message", outputCol = "words")
hashingTF = HashingTF(inputCol = tokenizer.getOutputCol(), outputCol = "tempfeatures")
idf = IDF(inputCol = hashingTF.getOutputCol(), outputCol = "features")
nbClassifier = NaiveBayes()

In [14]:
# Criação do Pipeline
pipeline = Pipeline(stages = [tokenizer, hashingTF, idf, nbClassifier])

In [15]:
# Criação do modelo com o Pipeline
modelo = pipeline.fit(dados_treino)

In [17]:
# Previsões nos dados de teste
previsoes = modelo.transform(dados_teste)
previsoes.select("prediction", "label").collect()

[Row(prediction=0.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=1.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0),
 Row(predi

In [18]:
# Avaliando a acurácia
avaliador = MulticlassClassificationEvaluator(predictionCol = "prediction", labelCol = "label", metricName = "accuracy")
avaliador.evaluate(previsoes)

0.9003322259136213

In [19]:
# Resumindo as previsões - Confusion Matrix
previsoes.groupBy("label","prediction").count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|  123|
|  0.0|       1.0|   19|
|  1.0|       0.0|   11|
|  0.0|       0.0|  148|
+-----+----------+-----+



# Fim

### Obrigado - Data Science Academy - <a href=http://facebook.com/dsacademy>facebook.com/dsacademybr</a>