# Veltec - Rank dos motoristas
## Explicação do arquivo CSV
---

Este arquivo CSV está relacionado a demanda da Veltec: Perfil do motorista.

O objetivo deste trabalho era classificar os motoristas em perfis de direção defensiva, ofensiva e econômica. Assim, dados os eventos registrados para este motorista durante uma viagem, tais como eventos de excesso de velocidade, aceleração brusca, frenagem bruca e outros, nosso algoritmo deveria classificar os motoristas nestes diferentes perfis. **Nesta primeira sprint, o score obtido refere-se somente ao quesito segurança**.

Definido isso, um primeiro passo para a classificação do motorista foi elaborar um método de cálculo de score. Este cálculo de score envolve diversas equações que resultam em uma pontuação para os motoristas em diferentes quesitos. Por fim, um score geral é calculado a partir dessas pontuações. Com o score geral, os motoristas foram divididos em diferentes faixas de pontuação (ranks).

São dois os CSVs apresentados neste notebook:

* vigencias_scores.csv: uma vigência seria um trajeto realizado por um motorista, seja uma viagem longa ou um trajeto mais curto. Este CSV apresenta todas as vigências e os respectivos eventos ocorridos, assim como os scores e ranks calculados para cada vigência.
* drivers_medias.csv: este CSV apresenta a média do score de cada motorista, assim como o rank médio resultante. Além disso, há a soma resultante de todas as vigências neste motorista no período.

Os arquivos são bastante similares, porém verifica-se que a distribuição dos scores se altera quando a média de cada motorista é obtida.

No arquivo em PDF "Descrição das tabelas - Veltec" há uma explicação sobre cada um dos atributos presentes na tabela. O dataset deste notebook foi retirado da tabela "vigencias_consolidadas.csv", assim a descrição dos atributos estará presente na seção "Consolidado" no dicionário.

---

In [1]:
import pandas as pd

In [2]:
vigencias = pd.read_csv('vigencias_scores.csv')
medias = pd.read_csv('drivers_medias.csv')

In [3]:
print('Dimensões do dataframe das vigências:')
print(vigencias.shape)
print('Dimensões do dataframe das médias:')
print(medias.shape)

Dimensões do dataframe das vigências:
(12642, 27)
Dimensões do dataframe das médias:
(676, 24)


In [4]:
vigencias.head()

Unnamed: 0,id_vei,id_uo_vei,id_motorista,id_uo_motorista,distancia_percorrida_km,qtd_banguela,qtd_curvas,qtd_aceleracoes,qtd_frenagens,qtd_vel_faixa_1,...,qtd_vel_via_faixa_3,tempo_vel_via_faixa_1,tempo_vel_via_faixa_2,tempo_vel_via_faixa_3,qtd_manuseio_celular,qtd_fadiga_motorista,qtd_distracao_motorista,qtd_uso_cigarro,score_geral,rank
0,66486,2855,636779.0,2950.0,234.42,0.0,0.0,0.0,0.0,0.0,...,1.0,208.0,0.0,1456.0,0.0,0.0,0.0,0.0,50.0,ruim
1,8155,3321,636786.0,2950.0,114.083,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,78.962685,regular
2,8577,3327,636592.0,2950.0,183.725,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo
3,7665,3581,636644.0,2950.0,702.525,0.0,0.0,0.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.437814,otimo
4,8427,2868,636989.0,2950.0,175.29,0.0,0.0,1.0,3.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,86.308403,bom


In [5]:
medias.head()

Unnamed: 0,id_motorista,distancia_percorrida_km,qtd_banguela,qtd_curvas,qtd_aceleracoes,qtd_frenagens,qtd_vel_faixa_1,qtd_vel_faixa_2,qtd_vel_faixa_3,tempo_vel_faixa_1,...,qtd_vel_via_faixa_3,tempo_vel_via_faixa_1,tempo_vel_via_faixa_2,tempo_vel_via_faixa_3,qtd_manuseio_celular,qtd_fadiga_motorista,qtd_distracao_motorista,qtd_uso_cigarro,score_geral,rank
0,380254.0,715.996,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1030.0,3848.0,0.0,0.0,0.0,0.0,74.987698,regular
1,394805.0,1455.529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo
2,394806.0,2081.978,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.977357,otimo
3,394807.0,3210.916,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,99.974232,otimo
4,394808.0,2617.744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,otimo


As colunas que váo de "id_vei" até a coluna "id_uo_motorista" são descritivas em relação ao motorista e a unidade operacional (uo).

O conjunto de colunas que abrangem desde "distancia_percorrida_km" até a coluna "qtd_uso_cigarro" são eventos que ocorreram durante uma vigência (uma viagem, por exemplo) de um motorista.

As coluna "score_geral" e "rank" foram elaboradas durante a primeira sprint da Veltec. O score geral foi cálculado através de uma série de equações envolvendo a distância percorrida e os eventos do motorista. o rank foi estabelecido seguindo a seguinte escala:

* 90 =< score geral <= 100: Ótimo
* 80 =< score geral < 90: Bom
* 60 =< score geral < 80: Regular
* 40 =< score geral < 60: Ruim
* score geral < 40: Péssimo

In [6]:
vigencias['rank'].value_counts()

otimo      7729
ruim       2369
bom         992
regular     960
pessimo     592
Name: rank, dtype: int64

In [7]:
medias['rank'].value_counts()

regular    238
otimo      224
bom        143
ruim        65
pessimo      6
Name: rank, dtype: int64

# PARTE DANIEL

In [155]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, accuracy_score

In [145]:
#Quantidade de Ranks - Balancear
print(vigencias["rank"].unique())

['ruim' 'regular' 'otimo' 'bom' 'pessimo']


In [146]:
#Gathering the dataset
X = vigencias.drop("rank", axis = 1)
y = vigencias["rank"]

In [147]:
# Train Test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,)

In [148]:
y_train.value_counts()

otimo      5796
ruim       1777
bom         744
regular     720
pessimo     444
Name: rank, dtype: int64

In [149]:
# COM CROSS-VALID
KFold = KFold(n_splits = 10, random_state = 100)
model = SGDClassifier()
resultKFold = cross_val_score(model, X_train, y_train, cv = KFold)


#HYPERTUNNED
grid_params = [
                {'penalty': ['l2', 'l1', 'elasticnet']}
                ]

gs = GridSearchCV(estimator = model, param_grid = grid_params, cv = 10, scoring = 'accuracy')
gs.fit(X_train, y_train)

print(gs.best_score_)

print("scores: ", resultKFold) 
print("Acuracia: %.2f%%" % (resultKFold.mean() * 100.0))

print(confusion_matrix(y_test, gs.predict(X_test)))

0.6654361354287522
scores:  [0.67650158 0.57172996 0.43248945 0.62974684 0.63080169 0.6592827
 0.56223629 0.42721519 0.63396624 0.69198312]
Acuracia: 59.16%
[[ 126  109    1   12    0]
 [ 597 1300    0   36    0]
 [  81    7    7    4   49]
 [ 155   78    0    7    0]
 [ 241  105   10   10  226]]


In [73]:
# SEM CROSS-VALID
model = SGDClassifier()
model.fit(X_train, y_train)
print(confusion_matrix(y_test, model.predict(X_test)))
print("Acuracia: %.2f%%" % accuracy_score(y_test, model.predict(X_test)))

[[   0  242    0    2    0]
 [   0 1951    0    3    0]
 [   0  101    0    4   50]
 [   0  224    0    3    0]
 [   0  412    0    7  162]]
Acuracia: 0.67%


In [74]:
####################################################################

In [153]:
# COM CROSS-VALID
KFold = KFold(n_splits = 10, random_state = 100)
model = DecisionTreeClassifier()
resultKFold = cross_val_score(model, X_train, y_train, cv = KFold)


#HYPERTUNNED
grid_params = [{'criterion': ['gini', 'entropy'],
               'splitter' : ['best','random']
               }]

gs = GridSearchCV(estimator = model, param_grid = grid_params, cv = 10, scoring = 'accuracy')
gs.fit(X_train, y_train)

print(gs.best_score_)

print("scores: ", resultKFold) 
print("Acuracia: %.2f%%" % (resultKFold.mean() * 100.0))

print(confusion_matrix(y_test, gs.predict(X_test)))

0.999894525893893
scores:  [1.         1.         1.         1.         0.99894515 1.
 1.         1.         1.         1.        ]
Acuracia: 99.99%
[[ 248    0    0    0    0]
 [   1 1932    0    0    0]
 [   0    0  146    0    2]
 [   0    0    0  240    0]
 [   0    0    0    0  592]]


In [80]:
# SEM CROSS-VALID
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print(confusion_matrix(y_test, model.predict(X_test)))
print("Acuracia: %.2f%%" % accuracy_score(y_test, model.predict(X_test)))

[[ 244    0    0    0    0]
 [   0 1954    0    0    0]
 [   0    0  153    0    2]
 [   1    0    0  226    0]
 [   0    0    0    0  581]]
Acuracia: 1.00%


In [None]:
####################################################################

In [156]:
# COM CROSS-VALID
KFold = KFold(n_splits = 10, random_state = 100)
model = KNeighborsClassifier()
resultKFold = cross_val_score(model, X_train, y_train, cv = KFold)


#HYPERTUNNED
grid_params = [{'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
               }]

gs = GridSearchCV(estimator = model, param_grid = grid_params, cv = 10, scoring = 'accuracy')
gs.fit(X_train, y_train)

print(gs.best_score_)

print("scores: ", resultKFold) 
print("Acuracia: %.2f%%" % (resultKFold.mean() * 100.0))

print(confusion_matrix(y_test, gs.predict(X_test)))

0.7798755405547938
scores:  [0.7776607  0.80907173 0.7721519  0.7478903  0.77320675 0.77637131
 0.79008439 0.77531646 0.77531646 0.79746835]
Acuracia: 77.95%
[[  85  152    0    8    3]
 [  47 1871    0   10    5]
 [  10   17   46   13   62]
 [  26  125    1   76   12]
 [  21   76   44   34  417]]


In [157]:
# SEM CROSS-VALID
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print(confusion_matrix(y_test, model.predict(X_test)))
print("Acuracia: %.2f%%" % accuracy_score(y_test, model.predict(X_test)))

[[  85  152    0    8    3]
 [  47 1871    0   10    5]
 [  10   17   46   13   62]
 [  26  125    1   76   12]
 [  21   76   44   34  417]]
Acuracia: 0.79%
