# Previsão de Votação de Deputados

Modelos preditivos de regressão utilizando a biblioteca [scikit learn](http://scikit-learn.org/stable/index.html) para a predição dos votos de deputados federais considerando dados das últimas eleições.

Tutorial utilizado como base [Regularized Linear Models](https://www.kaggle.com/apapiu/regularized-linear-models)


In [505]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import graphviz
from sklearn import preprocessing
from sklearn import tree
from sklearn.preprocessing import OneHotEncoder
from category_encoders.one_hot import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold

%matplotlib inline

In [506]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


In [507]:
train.isnull().sum()

ano                                      0
sequencial_candidato                     0
nome                                     0
uf                                       0
partido                                  0
quantidade_doacoes                       0
quantidade_doadores                      0
total_receita                            0
media_receita                            0
recursos_de_outros_candidatos.comites    0
recursos_de_pessoas_fisicas              0
recursos_de_pessoas_juridicas            0
recursos_proprios                        0
recursos_de_partido_politico             0
quantidade_despesas                      0
quantidade_fornecedores                  0
total_despesa                            0
media_despesa                            0
cargo                                    0
sexo                                     0
grau                                     0
estado_civil                             0
ocupacao                                 0
situacao   

In [508]:
test.isnull().sum()

ano                                      0
sequencial_candidato                     0
nome                                     0
uf                                       0
partido                                  0
quantidade_doacoes                       0
quantidade_doadores                      0
total_receita                            0
media_receita                            0
recursos_de_outros_candidatos.comites    0
recursos_de_pessoas_fisicas              0
recursos_de_pessoas_juridicas            0
recursos_proprios                        0
recursos_de_partido_politico             0
quantidade_despesas                      0
quantidade_fornecedores                  0
total_despesa                            0
media_despesa                            0
cargo                                    0
sexo                                     0
grau                                     0
estado_civil                             0
ocupacao                                 0
dtype: int6

In [509]:
x_train = train.loc[:,'partido':'ocupacao']
x_test = test.loc[:,'partido':'ocupacao']
x_train.head(3)

Unnamed: 0,partido,quantidade_doacoes,quantidade_doadores,total_receita,media_receita,recursos_de_outros_candidatos.comites,recursos_de_pessoas_fisicas,recursos_de_pessoas_juridicas,recursos_proprios,recursos_de_partido_politico,quantidade_despesas,quantidade_fornecedores,total_despesa,media_despesa,cargo,sexo,grau,estado_civil,ocupacao
0,PT,6,6,16600.0,2766.67,0.0,9000.0,6300.0,1300.0,0.0,14,14,16583.6,1184.54,DEPUTADO FEDERAL,MASCULINO,ENSINO MÉDIO COMPLETO,CASADO(A),VEREADOR
1,PT,13,13,22826.0,1755.85,6625.0,15000.0,1000.0,201.0,0.0,24,23,20325.99,846.92,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,SOLTEIRO(A),SERVIDOR PÚBLICO ESTADUAL
2,PT,17,16,158120.8,9301.22,2250.0,34150.0,62220.8,59500.0,0.0,123,108,146011.7,1187.09,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,VIÚVO(A),PEDAGOGO


### 1. Há desbalanceamento das classes (isto é, uma classe tem muito mais instâncias que outra)? Em que proporção? Quais efeitos colaterais o desbalanceamento de classes pode causar no classificador? Como você poderia tratar isso? (10 pt.) 

Há desbalanceamento das classes. Exixtem muito mais candidatos não eleitos do que eleitos. Esse problema pode afetar a acurácia do modelo.

Foi verificado, a partir da coluna `situacao` que há desalanceamento dos dados.

In [510]:
train['situacao'].value_counts()

nao_eleito    6596
eleito        1026
Name: situacao, dtype: int64

Porcentagem de candidatos não eleitos

In [511]:
nao_eleito = 6596/len(train) * 100
print(nao_eleito)

86.53896615061663


Porcentagem de candidatos eleitos

In [512]:
eleitos = 100 - nao_eleito
print(eleitos)

13.46103384938337


In [513]:
y_train = pd.core.series.Series(train['situacao']=='eleito', dtype='int64')
y_train

0       0
1       0
2       1
3       0
4       1
       ..
7617    0
7618    0
7619    0
7620    0
7621    0
Name: situacao, Length: 7622, dtype: int64

In [514]:
numeric_features = x_train.dtypes[x_train.dtypes != "object"].index
x_train[numeric_features][:5]

Unnamed: 0,quantidade_doacoes,quantidade_doadores,total_receita,media_receita,recursos_de_outros_candidatos.comites,recursos_de_pessoas_fisicas,recursos_de_pessoas_juridicas,recursos_proprios,recursos_de_partido_politico,quantidade_despesas,quantidade_fornecedores,total_despesa,media_despesa
0,6,6,16600.0,2766.67,0.0,9000.0,6300.0,1300.0,0.0,14,14,16583.6,1184.54
1,13,13,22826.0,1755.85,6625.0,15000.0,1000.0,201.0,0.0,24,23,20325.99,846.92
2,17,16,158120.8,9301.22,2250.0,34150.0,62220.8,59500.0,0.0,123,108,146011.7,1187.09
3,6,6,3001.12,500.19,0.0,1150.0,1101.12,750.0,0.0,8,8,3001.12,375.14
4,48,48,119820.0,2496.25,0.0,50878.0,0.0,68942.0,0.0,133,120,116416.64,875.31


In [515]:
x_train[numeric_features] = np.log1p(x_train[numeric_features])
x_test[numeric_features] = np.log1p(x_test[numeric_features])
x_train.head(3)

Unnamed: 0,partido,quantidade_doacoes,quantidade_doadores,total_receita,media_receita,recursos_de_outros_candidatos.comites,recursos_de_pessoas_fisicas,recursos_de_pessoas_juridicas,recursos_proprios,recursos_de_partido_politico,quantidade_despesas,quantidade_fornecedores,total_despesa,media_despesa,cargo,sexo,grau,estado_civil,ocupacao
0,PT,1.94591,1.94591,9.717218,7.925761,0.0,9.105091,8.748464,7.170888,0.0,2.70805,2.70805,9.71623,7.077954,DEPUTADO FEDERAL,MASCULINO,ENSINO MÉDIO COMPLETO,CASADO(A),VEREADOR
1,PT,2.639057,2.639057,10.035699,7.471278,8.798757,9.615872,6.908755,5.308268,0.0,3.218876,3.178054,9.919705,6.742786,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,SOLTEIRO(A),SERVIDOR PÚBLICO ESTADUAL
2,PT,2.890372,2.833213,11.971121,9.138008,7.71913,10.438547,11.038461,10.993748,0.0,4.820282,4.691348,11.891449,7.080102,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,VIÚVO(A),PEDAGOGO


In [516]:
categorical_features = x_train.dtypes[x_train.dtypes == "object"].index
x_train[categorical_features][:5]

Unnamed: 0,partido,cargo,sexo,grau,estado_civil,ocupacao
0,PT,DEPUTADO FEDERAL,MASCULINO,ENSINO MÉDIO COMPLETO,CASADO(A),VEREADOR
1,PT,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,SOLTEIRO(A),SERVIDOR PÚBLICO ESTADUAL
2,PT,DEPUTADO FEDERAL,FEMININO,SUPERIOR COMPLETO,VIÚVO(A),PEDAGOGO
3,PRONA,DEPUTADO FEDERAL,MASCULINO,ENSINO MÉDIO INCOMPLETO,CASADO(A),MILITAR REFORMADO
4,PT,DEPUTADO FEDERAL,MASCULINO,ENSINO FUNDAMENTAL COMPLETO,CASADO(A),DEPUTADO


In [517]:
onehotencode = OneHotEncoder(x_train[categorical_features], use_cat_names=True)
train_ohe = onehotencode.fit(x_train)
train_ohe_test = onehotencode.fit(x_test)

In [518]:
train_ohe = onehotencode.transform(x_train)
train_ohe_test = onehotencode.transform(x_test)

In [519]:
train_ohe.describe()

Unnamed: 0,partido_PSOL,partido_PSB,partido_PT,partido_PTB,partido_PC do B,partido_PRB,partido_PTN,partido_PRP,partido_PDT,partido_PHS,...,ocupacao_DIRETOR DE ESTABELECIMENTO DE ENSINO,ocupacao_SERRALHEIRO,ocupacao_PROGRAMADOR DE COMPUTADOR,ocupacao_AGENCIADOR DE PROPAGANDA,ocupacao_ZOOTECNISTA,ocupacao_TELEFONISTA,ocupacao_FISCAL DE TRANSPORTE COLETIVO,"ocupacao_TÉCNICO DE MINERAÇÃO, METALURGIA E GEOLOGIA",ocupacao_QUÍMICO,ocupacao_FAXINEIRO
count,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,...,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0,7622.0
mean,0.040278,0.067568,0.081212,0.054841,0.021385,0.016006,0.012333,0.014957,0.068355,0.029913,...,0.000656,0.000394,0.000394,0.000262,0.000131,0.000262,0.000131,0.000131,0.000525,0.000131
std,0.196624,0.251019,0.273179,0.227685,0.144675,0.125508,0.110373,0.121388,0.25237,0.17036,...,0.025606,0.019837,0.019837,0.016198,0.011454,0.016198,0.011454,0.011454,0.022904,0.011454
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# 2. Treine: um modelo de regressão logística, uma árvore de decisão, um modelo de adaboost, um modelo de random forest e um modelo de gradient boosting. Tune esses modelos usando validação cruzada e controle overfitting se necessário, considerando as particularidades de cada modelo.  (10 pts.)

# 3. Reporte precision, recall e f-measure no treino e validação. Há uma grande diferença de desempenho no treino/validação? Como você avalia os resultados? Justifique sua resposta. (10 pt.)

In [523]:
#Definindo valores para os folds
num_folds = 30
seed = 7
#Separando os dados em folds
kfold = KFold(num_folds, shuffle=True, random_state=seed)

# Random Forest

In [529]:
x_train = train_ohe.drop('partido_PT', axis=1)
x_test = train_ohe_test.drop('partido_PT', axis=1)
y_train = train_ohe['partido_PT']
y_test = train_ohe_test['partido_PT']

In [531]:
mdl = RandomForestClassifier(n_jobs=6, n_estimators=100, random_state=22)
mdl.fit(x_train, y_train)

In [None]:
predict_ohe = mdl.predict(x_test)

## Erro médio absoluto de 0.05

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, predict_ohe)

0.05727351916376307

In [None]:
print(classification_report(y_test, predict_ohe))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97      4268
           1       0.94      0.20      0.33       324

    accuracy                           0.94      4592
   macro avg       0.94      0.60      0.65      4592
weighted avg       0.94      0.94      0.92      4592



## Decision Tree

###### -----------

In [None]:
decision_tree = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=42)
decision_tree.fit(x_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

## Acurácia da Árvore de Decisão

In [None]:
predict_tree = decision_tree.predict(x_test)
accuracy = accuracy_score(y_test, predict_tree)
print('Acurácia da Árvore de Decisão: %.3f%%' % (accuracy * 100))

Acurácia da Árvore de Decisão: 92.944%


In [None]:
print(classification_report(y_test, predict_tree))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96      4268
           1       0.00      0.00      0.00       324

    accuracy                           0.93      4592
   macro avg       0.46      0.50      0.48      4592
weighted avg       0.86      0.93      0.90      4592



# AdaBoost

In [None]:
decision_tree2 = DecisionTreeClassifier(max_depth=4, random_state=42)
ada_clf = AdaBoostClassifier(decision_tree2, n_estimators=200, learning_rate=0.5)
ada_clf.fit(x_train, y_train)

# Acurácia AdaBoost

In [None]:
predict_ada = ada_clf.predict(x_test)
accuracy = accuracy_score(y_test, predict_ada)
print('Acurácia AdaBoost: %.3f%%' % (accuracy * 100))

Acurácia AdaBoost: 96.733%


In [None]:
print(classification_report(y_test, predict_ada))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      4268
           1       0.98      0.55      0.70       324

    accuracy                           0.97      4592
   macro avg       0.97      0.77      0.84      4592
weighted avg       0.97      0.97      0.96      4592



# Regressão Logística

### Regressão Logística com AdaBoosta

In [None]:
lr_ada = LogisticRegression(random_state=42, max_iter=1000)
ada_clf2 = AdaBoostClassifier(lr_ada, n_estimators=200, learning_rate=0.5)
ada_clf2.fit(x_train, y_train)

### Acurácia Regressão Logística com AdaBoosta

In [None]:
predict_lr_ada = ada_clf2.predict(x_test)
accuracy = accuracy_score(y_test, predict_lr_ada)
print('Acurácia Regressão Logística: %.3f%%' % (accuracy * 100))

Acurácia Regressão Logística: 91.311%


In [None]:
print(classification_report(y_test, predict_lr_ada))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      4268
           1       0.23      0.10      0.14       324

    accuracy                           0.91      4592
   macro avg       0.58      0.54      0.55      4592
weighted avg       0.89      0.91      0.90      4592



## Regressão Logística sem AdaBoost

In [532]:
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(x_train, y_train)

## Acurácia Regressão Logística

In [533]:
predict_lr = lr.predict(x_test)
accuracy = accuracy_score(y_test, predict_lr)
print('Acurácia Regressão Logística: %.3f%%' % (accuracy * 100))

Acurácia Regressão Logística: 97.147%


In [None]:
print(classification_report(y_test, predict_lr))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95      4268
           1       0.28      0.14      0.18       324

    accuracy                           0.91      4592
   macro avg       0.61      0.55      0.57      4592
weighted avg       0.89      0.91      0.90      4592



## GradientBoost

In [None]:
gb_clf = GradientBoostingClassifier(n_estimators=200, learning_rate=0.5, random_state=42)
gb_clf.fit(x_train, y_train)

## Acurácia GradientBoost

In [None]:
predict_gb_clf = gb_clf.predict(x_test)
accuracy = accuracy_score(y_test, predict_gb_clf)
print('Acurácia GradientBoost: %.3f%%' % (accuracy * 100))

Acurácia GradientBoost: 97.060%


In [None]:
print(classification_report(y_test, predict_gb_clf))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      4268
           1       0.98      0.60      0.74       324

    accuracy                           0.97      4592
   macro avg       0.97      0.80      0.86      4592
weighted avg       0.97      0.97      0.97      4592

