# Classification models with PySpark

In this notebook we will learn more about classification models using Pyspark library. The main objective is to create a classification models that identifies which clients wants to cancel the service or not.

## Preparing Data

In this section we will be preparing data to feed our model, also an analysis will be done.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark= SparkSession.builder.master('local[*]').getOrCreate()

spark

In [2]:
import zipfile

with zipfile.ZipFile('base de dados.zip', 'r') as zip:
    zip.extractall()

dados = spark.read.csv("base de dados\dados_clientes.csv", sep=',', header=True, inferSchema=True)

In [3]:
dados

DataFrame[id: int, Churn: string, Mais65anos: int, Conjuge: string, Dependentes: string, MesesDeContrato: int, TelefoneFixo: string, MaisDeUmaLinhaTelefonica: string, Internet: string, SegurancaOnline: string, BackupOnline: string, SeguroDispositivo: string, SuporteTecnico: string, TVaCabo: string, StreamingFilmes: string, TipoContrato: string, ContaCorreio: string, MetodoPagamento: string, MesesCobrados: double]

In [4]:
dados.limit(20).toPandas()

Unnamed: 0,id,Churn,Mais65anos,Conjuge,Dependentes,MesesDeContrato,TelefoneFixo,MaisDeUmaLinhaTelefonica,Internet,SegurancaOnline,BackupOnline,SeguroDispositivo,SuporteTecnico,TVaCabo,StreamingFilmes,TipoContrato,ContaCorreio,MetodoPagamento,MesesCobrados
0,0,Nao,0,Sim,Nao,1,Nao,SemServicoTelefonico,DSL,Nao,Sim,Nao,Nao,Nao,Nao,Mensalmente,Sim,BoletoEletronico,29.85
1,1,Nao,0,Nao,Nao,34,Sim,Nao,DSL,Sim,Nao,Sim,Nao,Nao,Nao,UmAno,Nao,Boleto,56.95
2,2,Sim,0,Nao,Nao,2,Sim,Nao,DSL,Sim,Sim,Nao,Nao,Nao,Nao,Mensalmente,Sim,Boleto,53.85
3,3,Nao,0,Nao,Nao,45,Nao,SemServicoTelefonico,DSL,Sim,Nao,Sim,Sim,Nao,Nao,UmAno,Nao,DebitoEmConta,42.3
4,4,Sim,0,Nao,Nao,2,Sim,Nao,FibraOptica,Nao,Nao,Nao,Nao,Nao,Nao,Mensalmente,Sim,BoletoEletronico,70.7
5,5,Sim,0,Nao,Nao,8,Sim,Sim,FibraOptica,Nao,Nao,Sim,Nao,Sim,Sim,Mensalmente,Sim,BoletoEletronico,99.65
6,6,Nao,0,Nao,Sim,22,Sim,Sim,FibraOptica,Nao,Sim,Nao,Nao,Sim,Nao,Mensalmente,Sim,CartaoCredito,89.1
7,7,Nao,0,Nao,Nao,10,Nao,SemServicoTelefonico,DSL,Sim,Nao,Nao,Nao,Nao,Nao,Mensalmente,Nao,Boleto,29.75
8,8,Sim,0,Sim,Nao,28,Sim,Sim,FibraOptica,Nao,Nao,Sim,Sim,Sim,Sim,Mensalmente,Sim,BoletoEletronico,104.8
9,9,Nao,0,Nao,Sim,62,Sim,Nao,DSL,Sim,Sim,Nao,Nao,Nao,Nao,UmAno,Nao,DebitoEmConta,56.15


In [4]:
dados.count()

10348

In [6]:
dados.groupBy('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|  Sim| 5174|
|  Nao| 5174|
+-----+-----+



In [7]:
dados.printSchema()

root
 |-- id: integer (nullable = true)
 |-- Churn: string (nullable = true)
 |-- Mais65anos: integer (nullable = true)
 |-- Conjuge: string (nullable = true)
 |-- Dependentes: string (nullable = true)
 |-- MesesDeContrato: integer (nullable = true)
 |-- TelefoneFixo: string (nullable = true)
 |-- MaisDeUmaLinhaTelefonica: string (nullable = true)
 |-- Internet: string (nullable = true)
 |-- SegurancaOnline: string (nullable = true)
 |-- BackupOnline: string (nullable = true)
 |-- SeguroDispositivo: string (nullable = true)
 |-- SuporteTecnico: string (nullable = true)
 |-- TVaCabo: string (nullable = true)
 |-- StreamingFilmes: string (nullable = true)
 |-- TipoContrato: string (nullable = true)
 |-- ContaCorreio: string (nullable = true)
 |-- MetodoPagamento: string (nullable = true)
 |-- MesesCobrados: double (nullable = true)



In [8]:
dados.groupBy('Internet').count().show()

+-----------+-----+
|   Internet|count|
+-----------+-----+
|FibraOptica| 5401|
|        Nao| 1741|
|        DSL| 3206|
+-----------+-----+



## Data Treatment

In this section we will treat data, change some of the types and change words for binary or integers to represent categories.

In [5]:
colunas_binarias=[
    'Churn',
    'Conjuge',
    'Dependentes',
    'BackupOnline',
    'TelefoneFixo',
    'MaisDeUmaLinhaTelefonica',
    'SegurancaOnline',
    'SeguroDispositivo',
    'SuporteTecnico',
    'TVaCabo',
    'StreamingFilmes',
    'ContaCorreio'
]

In [6]:
from pyspark.sql import functions as f

In [7]:
todas_colunas= [f.when(f.col(c) == 'Sim', 1).otherwise(0).alias(c) for c in colunas_binarias]

In [8]:
for coluna in reversed(dados.columns):
    if coluna not in colunas_binarias:
        todas_colunas.insert(0, coluna)
todas_colunas

['id',
 'Mais65anos',
 'MesesDeContrato',
 'Internet',
 'TipoContrato',
 'MetodoPagamento',
 'MesesCobrados',
 Column<'CASE WHEN (Churn = Sim) THEN 1 ELSE 0 END AS `Churn`'>,
 Column<'CASE WHEN (Conjuge = Sim) THEN 1 ELSE 0 END AS `Conjuge`'>,
 Column<'CASE WHEN (Dependentes = Sim) THEN 1 ELSE 0 END AS `Dependentes`'>,
 Column<'CASE WHEN (BackupOnline = Sim) THEN 1 ELSE 0 END AS `BackupOnline`'>,
 Column<'CASE WHEN (TelefoneFixo = Sim) THEN 1 ELSE 0 END AS `TelefoneFixo`'>,
 Column<'CASE WHEN (MaisDeUmaLinhaTelefonica = Sim) THEN 1 ELSE 0 END AS `MaisDeUmaLinhaTelefonica`'>,
 Column<'CASE WHEN (SegurancaOnline = Sim) THEN 1 ELSE 0 END AS `SegurancaOnline`'>,
 Column<'CASE WHEN (SeguroDispositivo = Sim) THEN 1 ELSE 0 END AS `SeguroDispositivo`'>,
 Column<'CASE WHEN (SuporteTecnico = Sim) THEN 1 ELSE 0 END AS `SuporteTecnico`'>,
 Column<'CASE WHEN (TVaCabo = Sim) THEN 1 ELSE 0 END AS `TVaCabo`'>,
 Column<'CASE WHEN (StreamingFilmes = Sim) THEN 1 ELSE 0 END AS `StreamingFilmes`'>,
 Column

In [9]:
dados.select(
    todas_colunas
).toPandas()

Unnamed: 0,id,Mais65anos,MesesDeContrato,Internet,TipoContrato,MetodoPagamento,MesesCobrados,Churn,Conjuge,Dependentes,BackupOnline,TelefoneFixo,MaisDeUmaLinhaTelefonica,SegurancaOnline,SeguroDispositivo,SuporteTecnico,TVaCabo,StreamingFilmes,ContaCorreio
0,0,0,1,DSL,Mensalmente,BoletoEletronico,29.850000,0,1,0,1,0,0,0,0,0,0,0,1
1,1,0,34,DSL,UmAno,Boleto,56.950000,0,0,0,0,1,0,1,1,0,0,0,0
2,2,0,2,DSL,Mensalmente,Boleto,53.850000,1,0,0,1,1,0,1,0,0,0,0,1
3,3,0,45,DSL,UmAno,DebitoEmConta,42.300000,0,0,0,0,0,0,1,1,1,0,0,0
4,4,0,2,FibraOptica,Mensalmente,BoletoEletronico,70.700000,1,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10343,10343,0,4,FibraOptica,Mensalmente,BoletoEletronico,86.687604,1,1,0,0,1,1,0,0,0,1,0,1
10344,10344,1,13,FibraOptica,Mensalmente,BoletoEletronico,86.195233,1,0,0,0,1,1,0,0,0,1,0,1
10345,10345,0,15,FibraOptica,Mensalmente,Boleto,75.099071,1,0,0,0,1,1,0,0,0,0,0,1
10346,10346,0,17,FibraOptica,Mensalmente,CartaoCredito,87.824082,1,0,0,0,1,1,0,0,0,0,1,1


In [10]:
dataset= dados.select(todas_colunas)

In [11]:
dataset.printSchema()

root
 |-- id: integer (nullable = true)
 |-- Mais65anos: integer (nullable = true)
 |-- MesesDeContrato: integer (nullable = true)
 |-- Internet: string (nullable = true)
 |-- TipoContrato: string (nullable = true)
 |-- MetodoPagamento: string (nullable = true)
 |-- MesesCobrados: double (nullable = true)
 |-- Churn: integer (nullable = false)
 |-- Conjuge: integer (nullable = false)
 |-- Dependentes: integer (nullable = false)
 |-- BackupOnline: integer (nullable = false)
 |-- TelefoneFixo: integer (nullable = false)
 |-- MaisDeUmaLinhaTelefonica: integer (nullable = false)
 |-- SegurancaOnline: integer (nullable = false)
 |-- SeguroDispositivo: integer (nullable = false)
 |-- SuporteTecnico: integer (nullable = false)
 |-- TVaCabo: integer (nullable = false)
 |-- StreamingFilmes: integer (nullable = false)
 |-- ContaCorreio: integer (nullable = false)



## Dummy Variables

In this section we will create dummy variables for those columns with a String type.

In [12]:
dados.select(['Internet', 'TipoContrato', 'MetodoPagamento']).show()

+-----------+------------+----------------+
|   Internet|TipoContrato| MetodoPagamento|
+-----------+------------+----------------+
|        DSL| Mensalmente|BoletoEletronico|
|        DSL|       UmAno|          Boleto|
|        DSL| Mensalmente|          Boleto|
|        DSL|       UmAno|   DebitoEmConta|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica| Mensalmente|   CartaoCredito|
|        DSL| Mensalmente|          Boleto|
|FibraOptica| Mensalmente|BoletoEletronico|
|        DSL|       UmAno|   DebitoEmConta|
|        DSL| Mensalmente|          Boleto|
|        Nao|    DoisAnos|   CartaoCredito|
|FibraOptica|       UmAno|   CartaoCredito|
|FibraOptica| Mensalmente|   DebitoEmConta|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica|    DoisAnos|   CartaoCredito|
|        Nao|       UmAno|          Boleto|
|FibraOptica|    DoisAnos|   DebitoEmConta|
|        DSL| Mensalmente|   CartaoCredito|
|FibraOptica| Mensalmente|Boleto

In [13]:
dataset.groupBy('id').pivot('Internet').agg(f.lit(1)).na.fill(0).show()

+----+---+-----------+---+
|  id|DSL|FibraOptica|Nao|
+----+---+-----------+---+
|7982|  1|          0|  0|
|9465|  0|          1|  0|
|2122|  1|          0|  0|
|3997|  1|          0|  0|
|6654|  0|          1|  0|
|7880|  0|          1|  0|
|4519|  0|          1|  0|
|6466|  0|          1|  0|
| 496|  1|          0|  0|
|7833|  0|          1|  0|
|1591|  0|          0|  1|
|2866|  0|          1|  0|
|8592|  0|          1|  0|
|1829|  0|          1|  0|
| 463|  0|          1|  0|
|4900|  0|          1|  0|
|4818|  0|          1|  0|
|7554|  1|          0|  0|
|1342|  0|          0|  1|
|5300|  0|          1|  0|
+----+---+-----------+---+
only showing top 20 rows



In [14]:
Internet = dataset.groupBy('id').pivot('Internet').agg(f.lit(1)).na.fill(0)
TipoContrato= dataset.groupBy('id').pivot('TipoContrato').agg(f.lit(1)).na.fill(0)
MetodoPagamento= dataset.groupBy('id').pivot('MetodoPagamento').agg(f.lit(1)).na.fill(0)

In [15]:
dataset\
    .join(Internet, 'id', how= 'inner')\
    .join(TipoContrato, 'id', how= 'inner')\
    .join(MetodoPagamento, 'id', how= 'inner')\
    .select(
        '*',
        f.col('DSL').alias('Internet_DSL'),
        f.col('FibraOptica').alias('Internet_FibraOptica'),
        f.col('Nao').alias('Internet_Nao'),
        f.col('Mensalmente').alias('TipoContrato_Mensalmente'),
        f.col('UmAno').alias('TipoContrato_UmAno'),
        f.col('DoisAnos').alias('TipoContrato_DoisAnos'),
        f.col('DebitoEmConta').alias('MetodoPagamento_DebitoEmConta'),
        f.col('CartaoCredito').alias('MetodoPagamento_CartaoCredito'),
        f.col('BoletoEletronico').alias('MetodoPagamento_BoletoEletronico'),
        f.col('Boleto').alias('MetodoPagamento_Boleto')
    )\
    .drop(
        'Internet', 'TipoContrato', 'MetodoPagamento',
        'DSL', 'FibraOptica', 'Nao',
        'Mensalmente', 'UmAno', 'DoisAnos',
        'DebitoEmConta', 'CartaoCredito', 'BoletoEletronico', 'Boleto'
    )\
    .limit(5).toPandas()

Unnamed: 0,id,Mais65anos,MesesDeContrato,MesesCobrados,Churn,Conjuge,Dependentes,BackupOnline,TelefoneFixo,MaisDeUmaLinhaTelefonica,...,Internet_DSL,Internet_FibraOptica,Internet_Nao,TipoContrato_Mensalmente,TipoContrato_UmAno,TipoContrato_DoisAnos,MetodoPagamento_DebitoEmConta,MetodoPagamento_CartaoCredito,MetodoPagamento_BoletoEletronico,MetodoPagamento_Boleto
0,0,0,1,29.85,0,1,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
1,1,0,34,56.95,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,1
2,2,0,2,53.85,1,0,0,1,1,0,...,1,0,0,1,0,0,0,0,0,1
3,3,0,45,42.3,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
4,4,0,2,70.7,1,0,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0


In [16]:
dataset = dataset\
    .join(Internet, 'id', how= 'inner')\
    .join(TipoContrato, 'id', how= 'inner')\
    .join(MetodoPagamento, 'id', how= 'inner')\
    .select(
        '*',
        f.col('DSL').alias('Internet_DSL'),
        f.col('FibraOptica').alias('Internet_FibraOptica'),
        f.col('Nao').alias('Internet_Nao'),
        f.col('Mensalmente').alias('TipoContrato_Mensalmente'),
        f.col('UmAno').alias('TipoContrato_UmAno'),
        f.col('DoisAnos').alias('TipoContrato_DoisAnos'),
        f.col('DebitoEmConta').alias('MetodoPagamento_DebitoEmConta'),
        f.col('CartaoCredito').alias('MetodoPagamento_CartaoCredito'),
        f.col('BoletoEletronico').alias('MetodoPagamento_BoletoEletronico'),
        f.col('Boleto').alias('MetodoPagamento_Boleto')
    )\
    .drop(
        'Internet', 'TipoContrato', 'MetodoPagamento',
        'DSL', 'FibraOptica', 'Nao',
        'Mensalmente', 'UmAno', 'DoisAnos',
        'DebitoEmConta', 'CartaoCredito', 'BoletoEletronico', 'Boleto'
    )

In [None]:
dataset.show()

+---+----------+---------------+-------------+-----+-------+-----------+------------+------------+------------------------+---------------+-----------------+--------------+-------+---------------+------------+------------+--------------------+------------+------------------------+------------------+---------------------+-----------------------------+-----------------------------+--------------------------------+----------------------+
| id|Mais65anos|MesesDeContrato|MesesCobrados|Churn|Conjuge|Dependentes|BackupOnline|TelefoneFixo|MaisDeUmaLinhaTelefonica|SegurancaOnline|SeguroDispositivo|SuporteTecnico|TVaCabo|StreamingFilmes|ContaCorreio|Internet_DSL|Internet_FibraOptica|Internet_Nao|TipoContrato_Mensalmente|TipoContrato_UmAno|TipoContrato_DoisAnos|MetodoPagamento_DebitoEmConta|MetodoPagamento_CartaoCredito|MetodoPagamento_BoletoEletronico|MetodoPagamento_Boleto|
+---+----------+---------------+-------------+-----+-------+-----------+------------+------------+------------------------

## Preparing Data for Logistic Regression

In this section we will create a Logistic Regression model for classifying if clients will cancel their contracts or not.

In [None]:
dataset.toPandas()

Unnamed: 0,id,Mais65anos,MesesDeContrato,MesesCobrados,Churn,Conjuge,Dependentes,BackupOnline,TelefoneFixo,MaisDeUmaLinhaTelefonica,...,Internet_DSL,Internet_FibraOptica,Internet_Nao,TipoContrato_Mensalmente,TipoContrato_UmAno,TipoContrato_DoisAnos,MetodoPagamento_DebitoEmConta,MetodoPagamento_CartaoCredito,MetodoPagamento_BoletoEletronico,MetodoPagamento_Boleto
0,0,0,1,29.850000,0,1,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
1,1,0,34,56.950000,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,1
2,2,0,2,53.850000,1,0,0,1,1,0,...,1,0,0,1,0,0,0,0,0,1
3,3,0,45,42.300000,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
4,4,0,2,70.700000,1,0,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10343,10343,0,4,86.687604,1,1,0,0,1,1,...,0,1,0,1,0,0,0,0,1,0
10344,10344,1,13,86.195233,1,0,0,0,1,1,...,0,1,0,1,0,0,0,0,1,0
10345,10345,0,15,75.099071,1,0,0,0,1,1,...,0,1,0,1,0,0,0,0,0,1
10346,10346,0,17,87.824082,1,0,0,0,1,1,...,0,1,0,1,0,0,0,1,0,0


In [17]:
from pyspark.ml.feature import VectorAssembler

In [18]:
dataset = dataset.withColumnRenamed('Churn', 'label')

X = dataset.columns
X.remove('label')
X.remove('id')

In [19]:
assembler = VectorAssembler(inputCols=X, outputCol='features')

In [21]:
dataset_prep = assembler.transform(dataset).select('features', 'label')
dataset_prep.limit(5).show()

## Tweaking and Prediction


In this section we will perform a final tweak to our data and predict the labels for our test data

In [23]:
SEED = 101
treino, teste = dataset_prep.randomSplit([0.7, 0.3], seed= SEED)

print(treino.count())
print(teste.count())

In [24]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression()

In [25]:
modelo_lr = lr.fit(treino)

In [26]:
previsoes_lr_teste = modelo_lr.transform(teste)

In [27]:
previsoes_lr_teste.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    1|[-0.3519619473682...|[0.41290673494697...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[0.95365030045450...|[0.72184869007248...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[0.14450666004979...|[0.53606392907100...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[0.87702149023861...|[0.70620462069356...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[2.78266768053301...|[0.94173200027892...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[1.12733470127584...|[0.75534669231844...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|[3.27995976958612...|[0.96373487760009...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|[-0.0315613304946...|[0.49211032228603...|       1.0|
|(24,[0,1,2,3,4,6,...|    0|[0.21299545255668...|[0.55304846021607...|       0.0|
|(24,[0,1,2,3,4,

## Metrics

In this section we will explore the model's metrics

In [28]:
resumo_lr_treino = modelo_lr.summary

In [29]:
print(f'Acurácia: {resumo_lr_treino.accuracy}')
print(f'Precisão: {resumo_lr_treino.precisionByLabel[1]}')
print(f'Recall: {resumo_lr_treino.recallByLabel[1]}')
print(f'F1: {resumo_lr_treino.fMeasureByLabel()[1]}')

Acurácia: 0.7873993893977241
Precisão: 0.7727866283624968
Recall: 0.8171775752554543
F1: 0.7943624161073827


In [30]:
tp = previsoes_lr_teste.select('label', 'prediction').where((f.column('label')== 1) & (f.column('prediction')==1)).count()
tn = previsoes_lr_teste.select('label', 'prediction').where((f.column('label')== 0) & (f.column('prediction')==0)).count()
fp = previsoes_lr_teste.select('label', 'prediction').where((f.column('label')== 0) & (f.column('prediction')==1)).count()
fn = previsoes_lr_teste.select('label', 'prediction').where((f.column('label')== 1) & (f.column('prediction')==0)).count()

print(f'True positives: {tp}\nTrue Negative: {tn}\nFalse Positives: {fp}\nFalse Negatives: {fn}')

True positives: 1269
True Negative: 1153
False Positives: 436
False Negatives: 284


In [31]:
def calcula_mostra_matriz_confusao(df_transform_modelo, normalize=False, percentage=True):
  tp = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 1)).count()
  tn = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 0)).count()
  fp = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 1)).count()
  fn = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 0)).count()

  valorP = 1
  valorN = 1

  if normalize:
    valorP = tp + fn
    valorN = fp + tn

  if percentage and normalize:
    valorP = valorP / 100
    valorN = valorN / 100

  print(' '*20, 'Previsto')
  print(' '*15, 'Churn', ' '*5 ,'Não-Churn')
  print(' '*4, 'Churn', ' '*6, int(tp/valorP), ' '*7, int(fn/valorP))
  print('Real')
  print(' '*4, 'Não-Churn', ' '*2, int(fp/valorN), ' '*7, int(tn/valorN))

In [32]:
calcula_mostra_matriz_confusao(previsoes_lr_teste)

                     Previsto
                Churn       Não-Churn
     Churn        1269         284
Real
     Não-Churn    436         1153


## Decision Tree

In this section we will learn and implement a Decison Tree

In [33]:
from pyspark.ml.classification import DecisionTreeClassifier

In [34]:
dtc = DecisionTreeClassifier(seed=SEED)
modelo_dtc = dtc.fit(treino)

In [35]:
previsoes_dtc_treino = modelo_dtc.transform(treino)
previsoes_dtc_treino.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0| [491.0,216.0]|[0.69448373408769...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[1472.0,113.0]|[0.92870662460567...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|    [17.0,1.0]|[0.94444444444444...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|    [17.0,1.0]|[0.94444444444444...|       0.0|
|(24,[0,1,2,3,4,5,...|    1|[299.0,1795.0]|[0.14278892072588...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[299.0,1795.0]|[0.14278892072588...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[299.0,1795.0]|[0.14278892072588...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|  [124.0,89.0]|[0.58215962441314...|       0.0|
|(24,[0,1,2,3,4,5,...|    0| [491.0,216.0]|[0.69448373408769...|       0.0|
|(24,[0,1,2,3,4,5,...|    1| [224.0,218.0]|[0.50678733031674...|       0.0|
|(24,[0,1,2,

## Decion Tree Metrics

In this section we will create metrics for the Decision Tree algorithm

In [36]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator()

In [37]:
evaluator.evaluate(previsoes_dtc_treino, {evaluator.metricName: 'accuracy'})

0.7932278656674993

In [38]:
acuracia_dtc_treino = evaluator.evaluate(previsoes_dtc_treino, {evaluator.metricName: 'accuracy'})
precisao_dtc_treino = evaluator.evaluate(previsoes_dtc_treino, {evaluator.metricName: 'precisionByLabel', evaluator.metricLabel: 1})
recall_dtc_treino = evaluator.evaluate(previsoes_dtc_treino, {evaluator.metricName: 'recallByLabel', evaluator.metricLabel: 1})
f1_dtc_treino = evaluator.evaluate(previsoes_dtc_treino, {evaluator.metricName: 'fMeasureByLabel', evaluator.metricLabel: 1})


print(f"Acurácia: {acuracia_dtc_treino}")
print(f"Precisão: {precisao_dtc_treino}")
print(f"Recall: {recall_dtc_treino}")
print(f"F1: {f1_dtc_treino}")

Acurácia: 0.7932278656674993
Precisão: 0.815516730826177
Recall: 0.7605633802816901
F1: 0.7870820234352672


In [39]:
previsoes_dtc_teste = modelo_dtc.transform(teste)

In [40]:
previsoes_dtc_teste.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    1|  [124.0,89.0]|[0.58215962441314...|       0.0|
|(24,[0,1,2,3,4,5,...|    0| [224.0,218.0]|[0.50678733031674...|       0.0|
|(24,[0,1,2,3,4,5,...|    0| [224.0,218.0]|[0.50678733031674...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[299.0,1795.0]|[0.14278892072588...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[1472.0,113.0]|[0.92870662460567...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[1472.0,113.0]|[0.92870662460567...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|[1472.0,113.0]|[0.92870662460567...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|  [61.0,120.0]|[0.33701657458563...|       1.0|
|(24,[0,1,2,3,4,6,...|    0| [224.0,218.0]|[0.50678733031674...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|   [27.0,99.0]|[0.21428571428571...|       1.0|
|(24,[0,1,2,

In [41]:
evaluator.evaluate(previsoes_dtc_teste, {evaluator.metricName: 'accuracy'})

0.781031190324634

## Random Forest

In this section we will implement Random Forest

In [42]:
from pyspark.ml.classification import RandomForestClassifier

rfc = RandomForestClassifier(seed = SEED)

In [43]:
modelo_rfc = rfc.fit(treino)

In [44]:
previsoes_rfc_treino = modelo_rfc.transform(treino)

In [45]:
previsoes_rfc_treino.limit(10).show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[15.3949696381251...|[0.76974848190625...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[15.9852123098756...|[0.79926061549378...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[8.05281817278802...|[0.40264090863940...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[8.05281817278802...|[0.40264090863940...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[5.06677885389007...|[0.25333894269450...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[5.06677885389007...|[0.25333894269450...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[4.50635581804925...|[0.22531779090246...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[7.79835516739668...|[0.38991775836983...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[12.1071214967267...|[0.60535607483633...|       0.0|
|(24,[0,1,2,3,4,

In [46]:
previsoes_rfc_teste = modelo_rfc.transform(teste)

previsoes_rfc_teste.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    1|[7.56359697427371...|[0.37817984871368...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[9.85539578811756...|[0.49276978940587...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[7.48794358290319...|[0.37439717914515...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[9.57290948135017...|[0.47864547406750...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[13.6683497311920...|[0.68341748655960...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[13.6727115026570...|[0.68363557513285...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|[16.2180161481769...|[0.81090080740884...|       0.0|
|(24,[0,1,2,3,4,6,...|    0|[5.70683751556433...|[0.28534187577821...|       1.0|
|(24,[0,1,2,3,4,6,...|    0|[8.09620429042921...|[0.40481021452146...|       1.0|
|(24,[0,1,2,3,4,

In [47]:
evaluator.evaluate(previsoes_rfc_teste, {evaluator.metricName: 'accuracy'})

0.7705283259070655

## Comparing Metrics

In this section we will compare the 3 models' metrics between them.

In [48]:
def compara_metricas_modelos(lista_previsoes):

  # ‘s’ será minha string de retorno
  # ela vai coletar e montar minha matriz de confusão
  # e também os valores de acurácia, precisão, recall e F1-score
  s = '\n'

  for modelo, df_transform_modelo in lista_previsoes.items():

    s += '-' * 50 + '\n' #linha de separação
    s += modelo + '\n'

    # os passos para montagem da matriz de confusão são os mesmos da aula
    tp = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 1)).count()
    tn = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 0)).count()
    fp = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 1)).count()
    fn = df_transform_modelo.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 0)).count()

    # construção da minha string da matriz de confusão  
    s += ' '*20 + 'Previsto\n'
    s += ' '*15 +  'Churn' + ' '*5 + 'Não-Churn\n'
    s += ' '*4 + 'Churn' + ' '*6 +  str(int(tp)) + ' '*7 + str(int(fn)) + '\n'
    s += 'Real\n'
    s += ' '*4 + 'Não-Churn' + ' '*2 + str(int(fp)) +  ' '*7 + str(int(tn))  + '\n'
    s += '\n'

    # adiciono os valores de cada métrica a minha string de retorno com MulticlassClassificationEvaluator
    evaluator = MulticlassClassificationEvaluator()

    s += f'Acurácia: {evaluator.evaluate(df_transform_modelo, {evaluator.metricName: "accuracy"})}\n'
    s += f'Precisão: {evaluator.evaluate(df_transform_modelo, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1})}\n'
    s += f'Recall: {evaluator.evaluate(df_transform_modelo, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1})}\n'
    s += f'F1: {evaluator.evaluate(df_transform_modelo, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1})}\n'

  return s

In [50]:
print(compara_metricas_modelos({'Logistic Regression': previsoes_lr_teste,
                                'DecisionTreeClassifier': previsoes_dtc_teste,
                                'RandomForestClassifier': previsoes_rfc_teste}))


--------------------------------------------------
Logistic Regression
                    Previsto
               Churn     Não-Churn
    Churn      1269       284
Real
    Não-Churn  436       1153

Acurácia: 0.7708465945257797
Precisão: 0.7442815249266862
Recall: 0.8171281390856407
F1: 0.7790055248618785
--------------------------------------------------
DecisionTreeClassifier
                    Previsto
               Churn     Não-Churn
    Churn      1184       369
Real
    Não-Churn  319       1270

Acurácia: 0.781031190324634
Precisão: 0.7877578176979375
Recall: 0.7623953638119768
F1: 0.7748691099476439
--------------------------------------------------
RandomForestClassifier
                    Previsto
               Churn     Não-Churn
    Churn      1324       229
Real
    Não-Churn  492       1097

Acurácia: 0.7705283259070655
Precisão: 0.7290748898678414
Recall: 0.8525434642627173
F1: 0.7859899079845651



## Cross Validation

In this section we will explore Cross Validation technique 

In [51]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [52]:
dtc = DecisionTreeClassifier(seed=SEED)

In [53]:
grid = ParamGridBuilder()\
        .addGrid(dtc.maxDepth, [2, 5, 10])\
        .addGrid(dtc.maxBins, [10, 32, 45])\
        .build()

In [54]:
evaluator = MulticlassClassificationEvaluator()

In [55]:
dtc_cv = CrossValidator(
    estimator= dtc,
    estimatorParamMaps= grid,
    evaluator= evaluator,
    numFolds= 3,
    seed= SEED
)

In [56]:
modelo_dtc_cv = dtc_cv.fit(treino)

In [58]:
previsoes_dtc_cv_teste = modelo_dtc_cv.transform(teste)

evaluator.evaluate(previsoes_dtc_cv_teste, {evaluator.metricName: 'accuracy'})

0.7889879057924889

In [59]:
print(compara_metricas_modelos({'DecisionTreeClassifier': previsoes_dtc_teste,
                                'DecisionTreeClassifierCV': previsoes_dtc_cv_teste}))


--------------------------------------------------
DecisionTreeClassifier
                    Previsto
               Churn     Não-Churn
    Churn      1184       369
Real
    Não-Churn  319       1270

Acurácia: 0.781031190324634
Precisão: 0.7877578176979375
Recall: 0.7623953638119768
F1: 0.7748691099476439
--------------------------------------------------
DecisionTreeClassifierCV
                    Previsto
               Churn     Não-Churn
    Churn      1320       233
Real
    Não-Churn  430       1159

Acurácia: 0.7889879057924889
Precisão: 0.7542857142857143
Recall: 0.849967804249839
F1: 0.7992733878292462

