%md
# EML4 - Aula 05 - Processamento de Massivo de Dados
### Preparação de dados e Aprendizado de Máquina em Spark


Principais links:
* [Spark docs](https://spark.apache.org/docs/latest/)
* [Datasets Compactados](https://drive.google.com/file/d/1MdZGO9quJVxuq5wcBafT25kA2UR4C1uG/view?usp=sharing)

In [1]:
#Importações e inicialização da sessão Spark para Notebooks fora do Databricks

#import findspark, pyspark
#from pyspark.sql import SparkSession
#findspark.init()
#spark = SparkSession.builder.getOrCreate()

#Checando se sessão Spark está funcionando
spark

### Preparação

Carregando o data set. O arquivo [crimes.csv](https://www.kaggle.com/ankkur13/boston-crime-data) deve ser adicionado ao cluster e caminho colocado na célula abaixo.

Este data set possui formato csv (comma-separated values) e possui o título das colunas, mas Spark aceita diferentes valores como Parquet, ORC, JDBC, LIBSVM e outros.

In [2]:
dataset_location = 'crime.csv'

dataset = spark.read.format('csv') \
               .option('inferSchema', True) \
               .option('header', True) \
               .option('sep', ',') \
               .load(dataset_location)

                                                                                

In [3]:
#Aternativamente dataset.show(?)
display(dataset)

DataFrame[INCIDENT_NUMBER: string, OFFENSE_CODE: int, OFFENSE_CODE_GROUP: string, OFFENSE_DESCRIPTION: string, DISTRICT: string, REPORTING_AREA: string, SHOOTING: string, OCCURRED_ON_DATE: timestamp, YEAR: int, MONTH: int, DAY_OF_WEEK: string, HOUR: int, UCR_PART: string, STREET: string, Lat: double, Long: double, Location: string]

In [4]:
dataset.printSchema()

root
 |-- INCIDENT_NUMBER: string (nullable = true)
 |-- OFFENSE_CODE: integer (nullable = true)
 |-- OFFENSE_CODE_GROUP: string (nullable = true)
 |-- OFFENSE_DESCRIPTION: string (nullable = true)
 |-- DISTRICT: string (nullable = true)
 |-- REPORTING_AREA: string (nullable = true)
 |-- SHOOTING: string (nullable = true)
 |-- OCCURRED_ON_DATE: timestamp (nullable = true)
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY_OF_WEEK: string (nullable = true)
 |-- HOUR: integer (nullable = true)
 |-- UCR_PART: string (nullable = true)
 |-- STREET: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long: double (nullable = true)
 |-- Location: string (nullable = true)



###Transformers e Estimators

**Exemplo 1**: Tokenizer é transformer para manipulação de dados. Seu objetivo é quebrar uma String em um vetor de tokens que podem ser manipulados

In [5]:
from pyspark.ml.feature import Tokenizer

tkn = Tokenizer(inputCol='OFFENSE_CODE_GROUP', outputCol='OFFENSE_CODE_TOKENIZED')
df = dataset.select('OFFENSE_CODE_GROUP').distinct()
tkn.transform(df).show()

[Stage 2:===>                                                     (1 + 14) / 15]

+--------------------+----------------------+
|  OFFENSE_CODE_GROUP|OFFENSE_CODE_TOKENIZED|
+--------------------+----------------------+
|             Larceny|             [larceny]|
| Auto Theft Recovery|  [auto, theft, rec...|
|   Firearm Discovery|  [firearm, discovery]|
|Recovered Stolen ...|  [recovered, stole...|
|License Plate Rel...|  [license, plate, ...|
|   License Violation|  [license, violation]|
|Motor Vehicle Acc...|  [motor, vehicle, ...|
|    Liquor Violation|   [liquor, violation]|
|Assembly or Gathe...|  [assembly, or, ga...|
|      Property Found|     [property, found]|
|      Simple Assault|     [simple, assault]|
|     Warrant Arrests|    [warrant, arrests]|
|Prisoner Related ...|  [prisoner, relate...|
|      Drug Violation|     [drug, violation]|
|             Robbery|             [robbery]|
|        Embezzlement|        [embezzlement]|
|Missing Person Lo...|  [missing, person,...|
|Investigate Property|  [investigate, pro...|
|  Firearm Violations|  [firearm, 

                                                                                

**Exemplo 2**: StandardScaler é um estimator com o objetivo de ajustar valores contínuos de forma obter um valor médio = 0 e/ou um desvio padrão = 1

Vamos padronizar os dados da localidade dos dados, filtrando dados nulos e inconsistentes? Veja o exemplo com a latitude

StandardScaler deve receber um vetor numerico como entrada, por isso utilizamos a transformação VectorAssembler para organizar os dados em um vetor

In [6]:
from pyspark.ml.feature import StandardScaler, VectorAssembler

df = dataset.select('Lat').filter('Lat is not null').filter('Lat > 0')
local_assembler = VectorAssembler(inputCols=["Lat"], outputCol="VetLat")
dfv = local_assembler.transform(df)
mms = StandardScaler(inputCol='VetLat', outputCol='LatStd', withStd=True, withMean=True)
mms.fit(dfv).transform(dfv).show(truncate=False)

                                                                                

+-----------+-------------+----------------------+
|Lat        |VetLat       |LatStd                |
+-----------+-------------+----------------------+
|42.26260773|[42.26260773]|[-1.8734208647489472] |
|42.35211146|[42.35211146]|[0.935988368931435]   |
|42.30812619|[42.30812619]|[-0.44465379742207745]|
|42.35945371|[42.35945371]|[1.1664523389354866]  |
|42.37525782|[42.37525782]|[1.6625234251808525]  |
|42.29919694|[42.29919694]|[-0.7249316964531362] |
|42.32073413|[42.32073413]|[-0.04890645757743463]|
|42.33380683|[42.33380683]|[0.3614291126735248]  |
|42.25614494|[42.25614494]|[-2.076279694222733]  |
|42.348866  |[42.348866]  |[0.8341174715783722]  |
|42.34432328|[42.34432328]|[0.6915272184790474]  |
|42.32324363|[42.32324363]|[0.029863583312779347]|
|42.26059891|[42.26059891]|[-1.9364751917228349] |
|42.27986526|[42.27986526]|[-1.3317287572522039] |
|42.27791927|[42.27791927]|[-1.3928109297455482] |
|42.31596119|[42.31596119]|[-0.198723026202991]  |
|42.28076737|[42.28076737]|[-1.

#### Exercício 1
Outro estimator, semelhante ao StandardScaler é o MinMaxScaler, seu objetivo é transformar valores de forma a se ajustarem a um limite mínimo e máximo. Seu uso mais comum é na normalização de valores para o intervalo [0,1]

Estude as características do atributo Longitude, selecione latutide e longitude válidos e aplique o estimator MinMaxScaler

In [16]:
#faça aqui o exercício 01

from pyspark.ml.feature import MinMaxScaler

df = dataset.select('Lat', 'Long').filter('Lat is not null').filter('Lat > 0').filter('Long is not null').filter('Long < -1')
local_assembler = VectorAssembler(inputCols=["Lat","Long"], outputCol="Local")
dfv = local_assembler.transform(df)
mms = MinMaxScaler(inputCol='Local', outputCol='LocalStd', max=1, min=0)
mms.fit(dfv).transform(dfv).show(truncate=False)

+-----------+------------+--------------------------+-----------------------------------------+
|Lat        |Long        |Local                     |LocalStd                                 |
+-----------+------------+--------------------------+-----------------------------------------+
|42.26260773|-71.12118637|[42.26260773,-71.12118637]|[0.1856653098710857,0.2673862497925831]  |
|42.35211146|-71.13531147|[42.35211146,-71.13531147]|[0.7360230336323171,0.20168738604231215] |
|42.30812619|-71.07692974|[42.30812619,-71.07692974]|[0.46555795830839225,0.47323330959508814]|
|42.35945371|-71.05964817|[42.35945371,-71.05964817]|[0.7811704704741356,0.553613590996304]   |
|42.37525782|-71.02466343|[42.37525782,-71.02466343]|[0.8783498171413008,0.7163351056474138]  |
|42.29919694|-71.06046974|[42.29919694,-71.06046974]|[0.4106520710912284,0.5497922930592086]  |
|42.32073413|-71.05676415|[42.32073413,-71.05676415]|[0.5430840810712674,0.5670277853760105]  |
|42.33380683|-71.10377843|[42.33380683,-

#### Imputer

**Exemplo 3**: Tratando valores ausentes com Imputer, substituindo pela média

In [17]:
from pyspark.ml.feature import Imputer

df = dataset.select("Lat","Long")
imput = Imputer(inputCols=["Lat","Long"], outputCols = ["NeWLat","NeWLong"])
modelo = imput.fit(df)
df = modelo.transform(df)
df.filter('Lat is null').show() 

+----+----+-----------------+---------------+
| Lat|Long|           NeWLat|        NeWLong|
+----+----+-----------------+---------------+
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.90603030527|
|NULL|NULL|42.21299505769182|-70.9

#### RFormula

**Exemplo 4**: Utilizando o estimator RFormula

In [18]:
from pyspark.ml.feature import RFormula

df = dataset\
  .select('OFFENSE_CODE_GROUP', 'YEAR', 'MONTH', 'HOUR', 'DISTRICT')\
  .where('DISTRICT is not null')
rf = RFormula(formula = 'OFFENSE_CODE_GROUP ~ . -DISTRICT')
rf.fit(df).transform(df).show()

+--------------------+----+-----+----+--------+------------------+-----+
|  OFFENSE_CODE_GROUP|YEAR|MONTH|HOUR|DISTRICT|          features|label|
+--------------------+----+-----+----+--------+------------------+-----+
|  Disorderly Conduct|2018|   10|  20|     E18|[2018.0,10.0,20.0]| 26.0|
|       Property Lost|2018|    8|  20|     D14| [2018.0,8.0,20.0]| 12.0|
|               Other|2018|   10|  19|      B2|[2018.0,10.0,19.0]|  4.0|
|  Aggravated Assault|2018|   10|  20|      A1|[2018.0,10.0,20.0]| 14.0|
|            Aircraft|2018|   10|  20|      A7|[2018.0,10.0,20.0]| 57.0|
|           Vandalism|2018|   10|  20|     C11|[2018.0,10.0,20.0]|  7.0|
|     Verbal Disputes|2018|   10|  19|      B2|[2018.0,10.0,19.0]|  8.0|
|      Simple Assault|2018|   10|  19|     E18|[2018.0,10.0,19.0]|  6.0|
|               Towed|2018|   10|  20|      D4|[2018.0,10.0,20.0]|  9.0|
|Motor Vehicle Acc...|2018|   10|  19|     D14|[2018.0,10.0,19.0]|  0.0|
|          Auto Theft|2018|   10|  20|     E13|[201

**Exemplo 4**: Estruturando atributos categóricos

In [19]:
df = dataset\
  .select('OFFENSE_CODE_GROUP', 'DAY_OF_WEEK', 'HOUR')
rf = RFormula(formula = 'OFFENSE_CODE_GROUP ~ DAY_OF_WEEK + HOUR')
rf.fit(df).transform(df).show()

+--------------------+-----------+----+--------------------+-----+
|  OFFENSE_CODE_GROUP|DAY_OF_WEEK|HOUR|            features|label|
+--------------------+-----------+----+--------------------+-----+
|  Disorderly Conduct|  Wednesday|  20|(7,[1,6],[1.0,20.0])| 26.0|
|       Property Lost|   Thursday|  20|(7,[2,6],[1.0,20.0])| 12.0|
|               Other|  Wednesday|  19|(7,[1,6],[1.0,19.0])|  4.0|
|  Aggravated Assault|  Wednesday|  20|(7,[1,6],[1.0,20.0])| 14.0|
|            Aircraft|  Wednesday|  20|(7,[1,6],[1.0,20.0])| 57.0|
|           Vandalism|    Tuesday|  20|(7,[3,6],[1.0,20.0])|  7.0|
|Motor Vehicle Acc...|  Wednesday|  20|(7,[1,6],[1.0,20.0])|  0.0|
|     Verbal Disputes|  Wednesday|  19|(7,[1,6],[1.0,19.0])|  8.0|
|      Simple Assault|  Wednesday|  19|(7,[1,6],[1.0,19.0])|  6.0|
|               Towed|  Wednesday|  20|(7,[1,6],[1.0,20.0])|  9.0|
|Motor Vehicle Acc...|  Wednesday|  19|(7,[1,6],[1.0,19.0])|  0.0|
|          Auto Theft|     Monday|  20|(7,[4,6],[1.0,20.0])| 1

#### Exercício 2

Utilizando RFormula, prepare um DataFrame para uma tarefa de classificação, onde o label será `OFFENSE_CODE_GROUP`, e os atributos serão: `DISTRICT`, `DAY_OF_WEEK` e `HOUR`.

Será necessário tratar valores nulos

In [21]:
#faça aqui o exercício 02

df = dataset\
    .select('OFFENSE_CODE_GROUP', 'DISTRICT', 'DAY_OF_WEEK', 'HOUR')\
    .where('DISTRICT is not null')

rf = RFormula(formula = 'OFFENSE_CODE_GROUP ~ .')
rf.fit(df).transform(df).show()

+--------------------+--------+-----------+----+--------------------+-----+
|  OFFENSE_CODE_GROUP|DISTRICT|DAY_OF_WEEK|HOUR|            features|label|
+--------------------+--------+-----------+----+--------------------+-----+
|  Disorderly Conduct|     E18|  Wednesday|  20|(18,[8,12,17],[1....| 26.0|
|       Property Lost|     D14|   Thursday|  20|(18,[6,13,17],[1....| 12.0|
|               Other|      B2|  Wednesday|  19|(18,[0,12,17],[1....|  4.0|
|  Aggravated Assault|      A1|  Wednesday|  20|(18,[3,12,17],[1....| 14.0|
|            Aircraft|      A7|  Wednesday|  20|(18,[9,12,17],[1....| 57.0|
|           Vandalism|     C11|    Tuesday|  20|(18,[1,14,17],[1....|  7.0|
|     Verbal Disputes|      B2|  Wednesday|  19|(18,[0,12,17],[1....|  8.0|
|      Simple Assault|     E18|  Wednesday|  19|(18,[8,12,17],[1....|  6.0|
|               Towed|      D4|  Wednesday|  20|(18,[2,12,17],[1....|  9.0|
|Motor Vehicle Acc...|     D14|  Wednesday|  19|(18,[6,12,17],[1....|  0.0|
|          A

# Classificação

Preparação

In [22]:
dataset_location = 'covtype.data'

dataset = spark.read.format('csv') \
               .option('inferSchema', True) \
               .option('header', False) \
               .option('sep', ',') \
               .load(dataset_location) \
               .withColumnRenamed('_c54', 'class')

rf = RFormula(formula = 'class ~ .')
bInput = rf.fit(dataset) \
           .transform(dataset) \
           .select('features', 'label')

bInput.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(54,[0,1,2,3,5,6,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  2.0|
|(54,[0,1,2,3,4,5,...|  2.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  2.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  2.0|
|(54,[0,1,2,3,4,5,...|  2.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
|(54,[0,2,3,4,5,6,...|  5.0|
|(54,[0,1,2,3,4,5,...|  5.0|
+--------------------+-----+
only showing top 20 rows



24/08/19 01:58:52 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


### LogisticRegression

In [23]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression()
model = lr.fit(bInput)

24/08/19 01:58:58 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/08/19 01:58:58 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
                                                                                

In [24]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal wi

In [25]:
print("accuracy: ", model.summary.accuracy)
print("precision by label: ", model.summary.precisionByLabel)
print("recall by label: ", model.summary.recallByLabel)

[Stage 268:===>                                                   (1 + 15) / 16]

accuracy:  0.7243241103453973
precision by label:  [0.7117775658990879, 0.7479221545633475, 0.6741283124128312, 0.6035367940673132, 0.2007042253521127, 0.491006988200252, 0.7245859245120374]
recall by label:  [0.6977530211480363, 0.8007702055411029, 0.811126027857023, 0.38514743356388786, 0.006004424312651428, 0.2467898888696954, 0.5737688932228181]


                                                                                

In [26]:
test = bInput.sample(fraction=0.0001)
model.transform(test).show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0090680686743...|[1.81999741313856...|       2.0|
|(54,[0,1,2,3,4,5,...|  3.0|[-9.0101641039548...|[1.55158307801299...|       3.0|
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0085683805230...|[4.48359267947652...|       2.0|
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0081713512667...|[2.51740347208869...|       2.0|
|(54,[0,1,2,3,4,5,...|  1.0|[-9.0087747208431...|[1.71659936908560...|       2.0|
|(54,[0,1,2,3,4,5,...|  1.0|[-9.0085582628801...|[3.12977099388803...|       1.0|
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0087378161775...|[2.58193308253375...|       2.0|
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0085232861942...|[1.84470268749794...|       2.0|
|(54,[0,1,2,3,4,5,...|  2.0|[-9.0080473656732...|[1.91954954671642...|       2.0|
|(54,[0,1,2,3,4,

                                                                                

### DecisionTree

In [27]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

train, test = bInput.randomSplit([0.7, 0.3])

dt = DecisionTreeClassifier()
model = dt.fit(train)
predictions = model.transform(test)

evaluator = MulticlassClassificationEvaluator(
    labelCol="label", 
    predictionCol="prediction", 
    metricName="accuracy")

accuracy = evaluator.evaluate(predictions)
print("accuracy:", accuracy)



accuracy: 0.6976323199816102


                                                                                

In [28]:
print(model.toDebugString)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_a32b50314ca7, depth=5, numNodes=47, numClasses=8, numFeatures=54
  If (feature 0 <= 3034.5)
   If (feature 0 <= 2565.5)
    If (feature 10 <= 0.5)
     If (feature 0 <= 2456.5)
      If (feature 3 <= 15.0)
       Predict: 4.0
      Else (feature 3 > 15.0)
       Predict: 3.0
     Else (feature 0 > 2456.5)
      If (feature 17 <= 0.5)
       Predict: 2.0
      Else (feature 17 > 0.5)
       Predict: 3.0
    Else (feature 10 > 0.5)
     If (feature 22 <= 0.5)
      Predict: 2.0
     Else (feature 22 > 0.5)
      If (feature 5 <= 1010.0)
       Predict: 2.0
      Else (feature 5 > 1010.0)
       Predict: 1.0
   Else (feature 0 > 2565.5)
    If (feature 15 <= 0.5)
     If (feature 0 <= 2939.5)
      If (feature 17 <= 0.5)
       Predict: 2.0
      Else (feature 17 > 0.5)
       Predict: 3.0
     Else (feature 0 > 2939.5)
      Predict: 2.0
    Else (feature 15 > 0.5)
     If (feature 9 <= 1365.5)
      If (feature 7 <= 214.5)
    

#### Exercício 3

* Carregue os dados do dataset `Crimes in Boston`
* Pré-processe os dados para uma tarefa de classificação. A coluna `OFFENSE_CODE_GROUP` será usada como classe e as colunas `DISTRICT`, `DAY_OF_WEEK`, `HOUR` e `SHOOTING` serão usadas como atributos.
 * Registros sem informações de `DISTRICT` não devem entrar na análise
 * Registros sem informação de `SHOOTING` devem assumir valor `N` (dica: usar `expr` e `coalesce`)
* Crie um modelo de Regressão Logística
* Avalie o modelo pela métrica accuracy

In [32]:
#faça aqui o exercício 03
from pyspark.sql.functions import expr

dataset_location = 'crime.csv'

dataset = spark.read.format('csv') \
               .option('inferSchema', True) \
               .option('header', True) \
               .option('sep', ',') \
               .load(dataset_location)


df = dataset \
    .select('OFFENSE_CODE_GROUP', 'DISTRICT', 'DAY_OF_WEEK', 'HOUR', expr("coalesce(SHOOTING, 'N') as SHOOTING"))\
    .where('DISTRICT is not null')


rf = RFormula(formula = 'OFFENSE_CODE_GROUP ~ .')

bInput = rf.fit(df) \
           .transform(df) \
           .select('features', 'label')

bInput.show()

lr = LogisticRegression()
model = lr.fit(bInput)
print("accuracy: ", model.summary.accuracy)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(19,[8,12,17,18],...| 26.0|
|(19,[6,13,17,18],...| 12.0|
|(19,[0,12,17,18],...|  4.0|
|(19,[3,12,17,18],...| 14.0|
|(19,[9,12,17,18],...| 57.0|
|(19,[1,14,17,18],...|  7.0|
|(19,[0,12,17,18],...|  8.0|
|(19,[8,12,17,18],...|  6.0|
|(19,[2,12,17,18],...|  9.0|
|(19,[6,12,17,18],...|  0.0|
|(19,[7,15,17,18],...| 19.0|
|(19,[8,12,17,18],...|  2.0|
|(19,[4,12,17,18],...|  3.0|
|(19,[4,12,17,18],...|  2.0|
|(19,[1,12,17,18],...|  4.0|
|(19,[0,12,17,18],...|  4.0|
|(19,[2,12,17,18],...|  3.0|
|(19,[1,12,17,18],...| 27.0|
|(19,[5,14,17,18],...|  0.0|
|(19,[2,12,17,18],...|  1.0|
+--------------------+-----+
only showing top 20 rows



                                                                                

accuracy:  0.1333768854701484


# Agrupamento

Preparação

In [33]:
dataset_location = 'covtype.data'

dataset = spark.read.format('csv') \
               .option('inferSchema', True) \
               .option('header', False) \
               .option('sep', ',') \
               .load(dataset_location) \
               .withColumnRenamed('_c54', 'class')

rf = RFormula(formula = ' ~ . -class')
bInput = rf.fit(dataset) \
           .transform(dataset) \
           .select('features', 'class')

bInput.show()

+--------------------+-----+
|            features|class|
+--------------------+-----+
|(54,[0,1,2,3,5,6,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    2|
|(54,[0,1,2,3,4,5,...|    2|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    2|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    2|
|(54,[0,1,2,3,4,5,...|    2|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
|(54,[0,2,3,4,5,6,...|    5|
|(54,[0,1,2,3,4,5,...|    5|
+--------------------+-----+
only showing top 20 rows



#### k-means

In [34]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans().setK(7)
model = kmeans.fit(bInput)

model.clusterCenters()

                                                                                

[array([2.75299212e+03, 1.62645112e+02, 1.68385153e+01, 2.42078791e+02,
        5.14256915e+01, 8.64830428e+02, 2.07967543e+02, 2.17810744e+02,
        1.39436955e+02, 9.92690362e+02, 2.12917776e-01, 8.72231793e-02,
        4.68266249e-01, 2.31592796e-01, 1.89882537e-02, 3.65105717e-02,
        2.99577134e-02, 4.42349256e-02, 1.00046985e-02, 4.11902897e-02,
        0.00000000e+00, 0.00000000e+00, 7.18559123e-03, 1.50490211e-01,
        3.82458888e-02, 2.20516836e-02, 3.53390760e-02, 3.75254503e-03,
        1.87940486e-05, 5.20595145e-03, 9.82928739e-03, 1.06499608e-04,
        2.14878622e-03, 8.54502741e-03, 2.16131558e-03, 5.11699295e-02,
        9.03617854e-02, 3.83837118e-02, 0.00000000e+00, 0.00000000e+00,
        1.25293657e-03, 5.02427565e-03, 7.75567737e-02, 4.57509789e-02,
        4.07329679e-02, 8.11965544e-02, 7.51699295e-02, 1.39702428e-03,
        8.01879405e-04, 0.00000000e+00, 0.00000000e+00, 7.86217698e-03,
        8.85199687e-03, 8.51996868e-03]),
 array([3.03076806e+03

In [35]:
predictions = model.transform(bInput)
predictions.show()

+--------------------+-----+----------+
|            features|class|prediction|
+--------------------+-----+----------+
|(54,[0,1,2,3,5,6,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    2|         3|
|(54,[0,1,2,3,4,5,...|    2|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    2|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    2|         1|
|(54,[0,1,2,3,4,5,...|    2|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
|(54,[0,2,3,4,5,6,...|    5|         3|
|(54,[0,1,2,3,4,5,...|    5|         3|
+--------------------+-----+----------+
only showing top 20 rows



In [36]:
from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

print("silhouette: ", silhouette)



silhouette:  0.5070710973839342


                                                                                

#### Exercício 4

* Carregue os dados do dataset `Crimes in Boston`
* Pré-processe os dados para uma tarefa de agrupamento. As colunas `Lat` (latitude) e `Long` (longitude) serão usadas como atributos.
 * Remover ou imputar valores nulos ou inconsistentes para ambos os atributos
* Crie um modelo de agrupamento pelo algoritmo k-means
* Avalie o modelo pelo índice silhueta

In [41]:
#faça aqui o exercício 04

dataset_location = 'crime.csv'

dataset = spark.read.format('csv') \
               .option('inferSchema', True) \
               .option('header', True) \
               .option('sep', ',') \
               .load(dataset_location)

df = dataset \
    .select('Lat', 'Long') \
    .where('Lat is not null and Lat > 0') \
    .where('Long is not null and Long < -1')

rf = RFormula(formula = ' ~ .')

bInput = rf.fit(df).transform(df)
bInput.show()

kmeans = KMeans().setK(5)
model = kmeans.fit(bInput)
predictions = model.transform(bInput)

evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

print("silhouette: ", silhouette)

+-----------+------------+--------------------+
|        Lat|        Long|            features|
+-----------+------------+--------------------+
|42.26260773|-71.12118637|[42.26260773,-71....|
|42.35211146|-71.13531147|[42.35211146,-71....|
|42.30812619|-71.07692974|[42.30812619,-71....|
|42.35945371|-71.05964817|[42.35945371,-71....|
|42.37525782|-71.02466343|[42.37525782,-71....|
|42.29919694|-71.06046974|[42.29919694,-71....|
|42.32073413|-71.05676415|[42.32073413,-71....|
|42.33380683|-71.10377843|[42.33380683,-71....|
|42.25614494|-71.12802506|[42.25614494,-71....|
|  42.348866|-71.08936284|[42.348866,-71.08...|
|42.34432328|-71.15778368|[42.34432328,-71....|
|42.32324363|-71.10892316|[42.32324363,-71....|
|42.26059891| -71.1030614|[42.26059891,-71....|
|42.27986526|-71.08798275|[42.27986526,-71....|
|42.27791927| -71.0964061|[42.27791927,-71....|
|42.31596119|-71.09042564|[42.31596119,-71....|
|42.28076737|-71.04736497|[42.28076737,-71....|
|42.31277782|-71.07562922|[42.31277782,-

# Pipeline

É possível executar todos os objetos (modelos e transformações) instanciados em um *pipeline* de aprendizado de máquina. Isso facilita a aplicação de modelos e reutilização.

In [42]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

iris = spark.read.csv("iris.csv", header=True, inferSchema=True, sep=",")
irisTreino, irisTeste = iris.randomSplit([0.7,0.3])

vector = VectorAssembler(inputCols=["sepal_length","sepal_width","petal_length","petal_width"],outputCol="features" )

indexer = StringIndexer(inputCol="species", outputCol="target")

mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4,5,3], featuresCol="features", labelCol="target")

pipeline = Pipeline(stages=[vector, indexer, mlp])
modelo = pipeline.fit(irisTreino)
previsao = modelo.transform(irisTeste)

previsao.select("target","features","rawprediction","probability","prediction").show(5, truncate=False)

performance = MulticlassClassificationEvaluator(labelCol="target",predictionCol="prediction", metricName="accuracy")
acuracia = performance.evaluate(previsao)
print(acuracia)

24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.0625
24/08/19 02:30:50 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.046875
24/08/19 02:

+------+-----------------+---------------------------------------------------------+---------------------------------------------------+----------+
|target|features         |rawprediction                                            |probability                                        |prediction|
+------+-----------------+---------------------------------------------------------+---------------------------------------------------+----------+
|0.0   |[4.3,3.0,1.1,0.1]|[176.44407418010744,75.68930974099888,-252.0275446345717]|[1.0,1.7488871172240973E-44,8.263048400568215E-187]|0.0       |
|0.0   |[4.4,3.2,1.3,0.2]|[176.44407418010744,75.68930974099888,-252.0275446345717]|[1.0,1.7488871172240973E-44,8.263048400568215E-187]|0.0       |
|0.0   |[4.6,3.1,1.5,0.2]|[176.44407418010744,75.68930974099888,-252.0275446345717]|[1.0,1.7488871172240973E-44,8.263048400568215E-187]|0.0       |
|0.0   |[4.8,3.0,1.4,0.1]|[176.44407418010744,75.68930974099888,-252.0275446345717]|[1.0,1.7488871172240973E-44,

# Crossvalidation e Tunning

Ao desenvolver aplicações com aprendizado de máquina, é possível buscar por parametrização adequada aos dados. É possível buscar pela melhor configuração utilizando *grid search* em conjunto com um bom estimador como validação cruzada.

In [43]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

iris = spark.read.csv("iris.csv", header=True, inferSchema=True, sep=",")
irisTreino, irisTeste = iris.randomSplit([0.8,0.2])

vector = VectorAssembler(inputCols=["sepal_length","sepal_width","petal_length","petal_width"],outputCol="features" )

indexer = StringIndexer(inputCol="species", outputCol="target")

mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4,5,3], featuresCol="features", labelCol="target")

pipeline = Pipeline(stages=[vector, indexer, mlp])

performance = MulticlassClassificationEvaluator(labelCol="target", metricName="accuracy")

grid = ParamGridBuilder().addGrid(mlp.maxIter,[10,100,1000]).addGrid(mlp.layers,[[4,4,4,3],[4,6,3]]).build()

#A validação cruzada tem seu próprio avaliador e vai subdividir os dados de treino para o processo de tunning
crossval = CrossValidator(estimator=pipeline,estimatorParamMaps=grid,evaluator=performance,numFolds=5)

#Podemos utilizar e avaliar novamente o modelos escolhido pela validação cruzada (validação da validação) usando hold out
modelo = crossval.fit(irisTreino)
previsao = modelo.transform(irisTeste)
previsao.select("rawprediction","probability","prediction").show(5, truncate=False)

performance = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
acuracia = performance.evaluate(previsao)
print(acuracia)

24/08/19 02:31:21 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:31:33 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:31:33 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:31:33 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
24/08/19 02:31:34 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125
24/08/19 02:31:34 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
24/08/19 02:31:34 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
24/08/19 02:31:34 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125
24/08/19 02:31:34 

+---------------------------------------------------------+-----------------------------------------------------------------+----------+
|rawprediction                                            |probability                                                      |prediction|
+---------------------------------------------------------+-----------------------------------------------------------------+----------+
|[26.836414490217727,44.19286633291703,-70.93182310341365]|[2.8986029289230287E-8,0.9999999710139709,1.0045756207150348E-50]|1.0       |
|[26.836414490217727,44.19286633291703,-70.93182310341365]|[2.8986029289230287E-8,0.9999999710139709,1.0045756207150348E-50]|1.0       |
|[26.836414490217727,44.19286633291703,-70.93182310341365]|[2.8986029289230287E-8,0.9999999710139709,1.0045756207150348E-50]|1.0       |
|[26.836414490217727,44.19286633291703,-70.93182310341365]|[2.8986029289230287E-8,0.9999999710139709,1.0045756207150348E-50]|1.0       |
|[26.836414490217727,44.19286633291703,-7