# Veltec - Rank dos motoristas

Autor: Alexandre Marcondes

**Exercício:** Montem uma classificacao utilizando a base da Veltec ou Senai.

Monte testes de avaliacao de diferentes classificadores considerando:

* Busca por hiperparametros (Considere testar parametros de regularizacao)

* Busca por features

* Utilize um metodo de validacao cruzada

## Explicação do arquivo CSV
---

Este arquivo CSV está relacionado a demanda da Veltec: Perfil do motorista.

O objetivo deste trabalho era classificar os motoristas em perfis de direção defensiva, ofensiva e econômica. Assim, dados os eventos registrados para este motorista durante uma viagem, tais como eventos de excesso de velocidade, aceleração brusca, frenagem bruca e outros, nosso algoritmo deveria classificar os motoristas nestes diferentes perfis. **Nesta primeira sprint, o score obtido refere-se somente ao quesito segurança**.

Definido isso, um primeiro passo para a classificação do motorista foi elaborar um método de cálculo de score. Este cálculo de score envolve diversas equações que resultam em uma pontuação para os motoristas em diferentes quesitos. Por fim, um score geral é calculado a partir dessas pontuações. Com o score geral, os motoristas foram divididos em diferentes faixas de pontuação (ranks).

São dois os CSVs apresentados neste notebook:

* vigencias_scores.csv: uma vigência seria um trajeto realizado por um motorista, seja uma viagem longa ou um trajeto mais curto. Este CSV apresenta todas as vigências e os respectivos eventos ocorridos, assim como os scores e ranks calculados para cada vigência.
* drivers_medias.csv: este CSV apresenta a média do score de cada motorista, assim como o rank médio resultante. Além disso, há a soma resultante de todas as vigências neste motorista no período.

Os arquivos são bastante similares, porém verifica-se que a distribuição dos scores se altera quando a média de cada motorista é obtida.

No arquivo em PDF "Descrição das tabelas - Veltec" há uma explicação sobre cada um dos atributos presentes na tabela. O dataset deste notebook foi retirado da tabela "vigencias_consolidadas.csv", assim a descrição dos atributos estará presente na seção "Consolidado" no dicionário.

---

In [1]:
import pandas as pd

In [21]:
vigencias = pd.read_csv('vigencias_scores.csv')
medias = pd.read_csv('drivers_medias.csv')

In [22]:
print('Dimensões do dataframe das vigências:')
print(vigencias.shape)
print('Dimensões do dataframe das médias:')
print(medias.shape)

Dimensões do dataframe das vigências:
(12642, 27)
Dimensões do dataframe das médias:
(676, 24)


In [23]:
vigencias.head()

Unnamed: 0,id_vei,id_uo_vei,id_motorista,id_uo_motorista,distancia_percorrida_km,qtd_banguela,qtd_curvas,qtd_aceleracoes,qtd_frenagens,qtd_vel_faixa_1,...,qtd_vel_via_faixa_3,tempo_vel_via_faixa_1,tempo_vel_via_faixa_2,tempo_vel_via_faixa_3,qtd_manuseio_celular,qtd_fadiga_motorista,qtd_distracao_motorista,qtd_uso_cigarro,score_geral,rank
0,66486,2855,636779.0,2950.0,234.42,0.0,0.0,0.0,0.0,0.0,...,1.0,208.0,0.0,1456.0,0.0,0.0,0.0,0.0,50.0,ruim
1,8155,3321,636786.0,2950.0,114.083,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,78.962685,regular
2,8577,3327,636592.0,2950.0,183.725,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo
3,7665,3581,636644.0,2950.0,702.525,0.0,0.0,0.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.437814,otimo
4,8427,2868,636989.0,2950.0,175.29,0.0,0.0,1.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,86.308403,bom


In [24]:
medias.head()

Unnamed: 0,id_motorista,distancia_percorrida_km,qtd_banguela,qtd_curvas,qtd_aceleracoes,qtd_frenagens,qtd_vel_faixa_1,qtd_vel_faixa_2,qtd_vel_faixa_3,tempo_vel_faixa_1,...,qtd_vel_via_faixa_3,tempo_vel_via_faixa_1,tempo_vel_via_faixa_2,tempo_vel_via_faixa_3,qtd_manuseio_celular,qtd_fadiga_motorista,qtd_distracao_motorista,qtd_uso_cigarro,score_geral,rank
0,380254.0,715.996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1030.0,3848.0,0.0,0.0,0.0,0.0,74.987698,regular
1,394805.0,1455.529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo
2,394806.0,2081.978,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.977357,otimo
3,394807.0,3210.916,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.974232,otimo
4,394808.0,2617.744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo


As colunas que váo de "id_vei" até a coluna "id_uo_motorista" são descritivas em relação ao motorista e a unidade operacional (uo).

O conjunto de colunas que abrangem desde "distancia_percorrida_km" até a coluna "qtd_uso_cigarro" são eventos que ocorreram durante uma vigência (uma viagem, por exemplo) de um motorista.

As coluna "score_geral" e "rank" foram elaboradas durante a primeira sprint da Veltec. O score geral foi cálculado através de uma série de equações envolvendo a distância percorrida e os eventos do motorista. o rank foi estabelecido seguindo a seguinte escala:

* 90 =< score geral <= 100: Ótimo
* 80 =< score geral < 90: Bom
* 60 =< score geral < 80: Regular
* 40 =< score geral < 60: Ruim
* score geral < 40: Péssimo

In [51]:
vigencias['rank'].value_counts()

1    7729
4    2369
0     992
3     960
2     592
Name: rank, dtype: int64

In [26]:
medias['rank'].value_counts()

regular    238
otimo      224
bom        143
ruim        65
pessimo      6
Name: rank, dtype: int64

**Uma boa opção para coluna-alvo seria a coluna "rank"**. Uma classificação poderia ser realizada, por exemplo, para classificar os motoristas entre "ótimos motoristas" e "demais motoristas".

# Pipeline
---

### Bibliotecas
---

In [27]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
import seaborn as sns

from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, classification_report

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

#Classificadores Lineares
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression

#Classificadores KNN
from sklearn.neighbors import KNeighborsClassifier

#Classificadores Naive Nayes
from sklearn.naive_bayes import MultinomialNB

#Classificadores Arvores de Decisão
from sklearn.tree import DecisionTreeClassifier

#SVM
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, classification_report

from sklearn.preprocessing import StandardScaler

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import math

from sklearn import linear_model
from scipy.special import expit

from sklearn.metrics import confusion_matrix
import scipy
from scipy.io import arff

import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.svm import SVC
#from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.externals import joblib

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, cross_val_score, LeaveOneOut

### Preparação das features
---

In [52]:
vigencias.columns

Index(['id_vei', 'id_uo_vei', 'id_motorista', 'id_uo_motorista',
       'distancia_percorrida_km', 'qtd_banguela', 'qtd_curvas',
       'qtd_aceleracoes', 'qtd_frenagens', 'qtd_vel_faixa_1',
       'qtd_vel_faixa_2', 'qtd_vel_faixa_3', 'tempo_vel_faixa_1',
       'tempo_vel_faixa_2', 'tempo_vel_faixa_3', 'qtd_vel_via_faixa_1',
       'qtd_vel_via_faixa_2', 'qtd_vel_via_faixa_3', 'tempo_vel_via_faixa_1',
       'tempo_vel_via_faixa_2', 'tempo_vel_via_faixa_3',
       'qtd_manuseio_celular', 'qtd_fadiga_motorista',
       'qtd_distracao_motorista', 'qtd_uso_cigarro', 'score_geral', 'rank'],
      dtype='object')

In [53]:
# Label encoder para a variável "rank"
le = preprocessing.LabelEncoder()
vigencias['rank'] = le.fit_transform(vigencias['rank'])

# Colunas características
X = vigencias.drop(['rank','score_geral'],axis=1)
# Coluna alvo
y = vigencias.loc[:,'rank']

# Conjunto de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Criação das pipelines
---

In [54]:
# Crição das pipelines
pipe_log = Pipeline([('scl', StandardScaler()), ('clf', LogisticRegression())])
pipe_knn = Pipeline([('scl', StandardScaler()), ('clf', KNeighborsClassifier(n_neighbors=3))])
pipe_tree = Pipeline([('scl', StandardScaler()), ('clf', DecisionTreeClassifier())])
pipe_nb = Pipeline([('scl', StandardScaler()), ('clf', GaussianNB())])

pipe_list = [pipe_log, pipe_knn, pipe_tree, pipe_nb]

# Dicionário para facilitar identificacao
pipe_dict = {0: 'Logistic Regression', 1: 'KNN', 2: 'Decision Tree', 3: 'Navie Bayes'}

# aplicando fit
# Generaliza a execucao do fit de cada ultima funcao do pipe
for pipe in pipe_list:
    pipe.fit(X_train, y_train)

# Compara acurácia
for idx, val in enumerate(pipe_list):
    print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx], val.score(X_test, y_test)))

# para cada modelo treinado obtem val score
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(pipe_list):
    # Descobre o melhor val.score e armazen em best_clf
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
print('Classifier with best accuracy: %s' % pipe_dict[best_clf])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression pipeline test accuracy: 0.849
KNN pipeline test accuracy: 0.847
Decision Tree pipeline test accuracy: 0.879
Navie Bayes pipeline test accuracy: 0.147
Classifier with best accuracy: Decision Tree


### Busca por features
---

In [61]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
print(X.shape)
X_new = SelectKBest(chi2, k=15).fit_transform(X, y) # Escolha de 5 features com o método kbest
print(X_new.shape)

(12642, 25)
(12642, 15)


### Otimizando os hiperparâmetros

### Gridsearch
---

In [67]:
# Utiliza o novo X, gerado através da seleção de features
X_train, X_test, y_train, y_test = train_test_split(X_new,y, test_size=0.2)

In [64]:
# Utilizar aqui o algoritmo com melhor score
pipe = [pipe_tree]

param_range = [1, 2, 3, 4, 5]

# grid search params
#grid_params = [{'clf__criterion': ['gini', 'entropy'],
#               'clf__presort': [True, False]}]
grid_params = [{'clf__criterion': ['gini', 'entropy'],
    'clf__min_samples_leaf': param_range,
    'clf__max_depth': param_range,
    'clf__min_samples_split': param_range[1:],}]

# Construct grid search
gs = GridSearchCV(estimator=pipe_tree,
    param_grid=grid_params,
    scoring='accuracy', n_jobs=-1)

# Fit using grid search
gs.fit(X_train, y_train)

# Best accuracy
print('Best accuracy: %.3f' % gs.best_score_)

# Best params
print('\nBest params:\n', gs.best_params_)

Best accuracy: 0.824

Best params:
 {'clf__criterion': 'entropy', 'clf__max_depth': 5, 'clf__min_samples_leaf': 3, 'clf__min_samples_split': 4}


### Validação cruzada
---

In [65]:
kfold = KFold(n_splits=10, random_state=100, shuffle=True)
model_kfold = DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_leaf=3, min_samples_split=2)
results_kfold = cross_val_score(model_kfold, X_train, y_train, cv=kfold)

print("scores: ", results_kfold) 

print("Acuracia: %.2f%%" % (results_kfold.mean()*100.0))

scores:  [0.81422925 0.8013834  0.79743083 0.79525223 0.80019782 0.8090999
 0.82789318 0.82888229 0.78832839 0.80613254]
Acuracia: 80.69%


In [66]:
tree = DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_leaf=5, min_samples_split=2)
tree.fit(X_train,y_train)
y_pred = tree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.41      0.46      0.43       197
           1       0.87      0.94      0.91      1580
           2       0.85      0.58      0.69       104
           3       0.72      0.19      0.30       213
           4       0.78      0.85      0.82       435

    accuracy                           0.81      2529
   macro avg       0.73      0.60      0.63      2529
weighted avg       0.81      0.81      0.79      2529

