# Pré-processamento e Testes Dataset Copa

O dataset aqui utilizado foi retirado do [FIFA World Men's Ranking](https://www.fifa.com/fifa-world-ranking/ranking-table/men/index.html). Pretende-se utilizar a posição no ranking de determinada seleção nos 5 meses anteriores à Copa do Mundo (séries temporais) e a sua classificação nos anos de 2002, 2006, 2010 e 2014 para prever a classificação na Copa do Mundo de 2018.



**Alunos:**

- Marcos Wenneton V. de Araújo
- Luiz Eduardo F. Bentes

In [258]:
#imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

In [261]:
#abrindo o dataset de treino
df_total = pd.read_csv("dataset_rank_fifa.csv")
df_pred = df_total.drop(['country','year','classification'], axis=1)
df_total.head()

Unnamed: 0,country,year,jan,feb,mar,apr,may,classification
0,Denmark,2002,672,670,672,660,657,0
1,Senegal,2002,536,595,595,596,599,0
2,Uruguay,2002,664,662,661,660,652,0
3,France,2002,812,809,807,810,802,0
4,Spain,2002,730,727,728,715,713,0


## Pré-processamento
### Separando o dataset em anos

Vamos separar o dataset em anos para então aplicar o `StandardScaler`, já que alguns anos possuem uma média de valores no ranking diferente dos outros. Após feito isso, juntamos os datasets separados em `new_df`.

In [40]:
df_2002 = df_pred.loc[df_total['year'] == 2002]
df_2006 = df_pred.loc[df_total['year'] == 2006]
df_2010 = df_pred.loc[df_total['year'] == 2010]
df_2014 = df_pred.loc[df_total['year'] == 2014]

In [41]:
df_2002 = pd.DataFrame(StandardScaler().fit_transform(df_2002), index=df_2002.index, columns=df_2002.columns)
df_2006 = pd.DataFrame(StandardScaler().fit_transform(df_2006), index=df_2006.index, columns=df_2006.columns)
df_2010 = pd.DataFrame(StandardScaler().fit_transform(df_2010), index=df_2010.index, columns=df_2010.columns)
df_2014 = pd.DataFrame(StandardScaler().fit_transform(df_2014), index=df_2014.index, columns=df_2014.columns)

In [43]:
new_df = X
new_df['classification'] = df_total['classification']

In [44]:
new_df.classification.value_counts()

0    112
4      4
3      4
2      4
1      4
Name: classification, dtype: int64

In [45]:
new_df0 = new_df.loc[new_df['classification'] == 0]
new_df0 = new_df0.sample(92)
new_df = new_df.drop(new_df0.index)

In [46]:
new_df = new_df.reset_index(drop=True)
new_df.classification.value_counts()
#remoção de alguns exemplos para que o dataset fique mais balanceado

0    20
4     4
3     4
2     4
1     4
Name: classification, dtype: int64

### Divisão dos dados em treino e teste

Utilização de `StratifiedShuffleSplit` para a divisão do dataset.

In [47]:
feature_cols = [x for x in new_df.columns if x not in 'classification']

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25)

train_idx, test_idx = next(sss.split(new_df[feature_cols], new_df['classification']))

In [48]:
X_train = new_df.loc[train_idx, feature_cols]
y_train = new_df.loc[train_idx, 'classification']

X_test = new_df.loc[test_idx, feature_cols]
y_test = new_df.loc[test_idx, 'classification']

## Testes com modelos de ML
### Regressão Logística

### Redes Neurais

In [53]:
mlp = MLPClassifier(hidden_layer_sizes=(3,2), activation='identity', solver='lbfgs')
mlp.fit(X_train,y_train)

y_pred = mlp.predict(X_test)

#print(cross_val_score(mlp,X,y,cv=sss))

print("F-score: ", f1_score(y_pred, y_test, average='micro'))
print("Acurácia: ", accuracy_score(y_pred,y_test))

F-score:  0.3333333333333333
Acurácia:  0.3333333333333333


In [54]:
y_pred

array([0, 1, 0, 4, 2, 0, 3, 0, 0])

### Bagging

In [57]:
bc = BaggingClassifier(n_estimators=500)

bc.fit(X_train,y_train)

y_pred = bc.predict(X_test)

print("F-score: ", f1_score(y_pred, y_test, average='micro'))
print("Acurácia: ", accuracy_score(y_pred,y_test))

F-score:  0.6666666666666666
Acurácia:  0.6666666666666666


In [58]:
y_pred

array([0, 4, 0, 4, 3, 2, 0, 0, 0])

In [59]:
np.array(y_test)

array([2, 4, 0, 0, 3, 1, 0, 0, 0])

## Naive Bayes

In [61]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

print("F-score: ", f1_score(y_pred, y_test, average='micro'))
print("Acurácia: ", accuracy_score(y_pred,y_test))

F-score:  0.4444444444444444
Acurácia:  0.4444444444444444


In [62]:
y_pred

array([2, 2, 0, 2, 3, 2, 3, 0, 3])

In [63]:
np.array(y_test)

array([2, 4, 0, 0, 3, 1, 0, 0, 0])

## Resultados

Para conseguirmos obter uma acurácia mais precisa, iremos treinar 8 modelos diferentes e comparar suas previsões para então decidir o resultado final. Os modelos são:

- Bagging utilizando Decision Trees com 20, 50, 100 e 200 estimadores (4 modelos);
- Bagging utilizando Redes Neurais Artificiais com 20 estimadores;
- Naive Bayes;
- Somente um Rede Neural;
- Decision Trees.

In [186]:
df2018 = pd.read_csv("dataset_rank_fifa_2018.csv")
df2018.head()

Unnamed: 0,country,jan,feb,mar,apr,may
0,Germany,1602,1602,1609,1533,1558
1,Brazil,1483,1484,1489,1384,1431
2,Portugal,1358,1358,1360,1306,1274
3,Argentina,1348,1348,1359,1254,1241
4,Belgium,1325,1325,1337,1346,1298


In [187]:
df2018_predict = df2018.drop('country', axis=1)
#aplicando o StandardScaler nos dados de previsão
df2018_predict = pd.DataFrame(StandardScaler().fit_transform(df2018_predict),
                              index=df2018_predict.index, columns=df2018_predict.columns)
results = pd.DataFrame()

In [188]:
X = new_df.drop('classification', axis=1)
y = new_df['classification']

In [189]:
#Bagging utilizando DecisionTrees
list_estimators = [20,50,100,200]

for est in list_estimators:
    bc = BaggingClassifier( n_estimators=est)
    
    bc.fit(X,y)
    
    y_pred = bc.predict(df2018_predict)
    
    results['predict_Bagging_' + str(est)] = y_pred

In [192]:
#Bagging com RNAs
mlp = MLPClassifier(hidden_layer_sizes=(3,2), activation='identity', solver='lbfgs')
bc = BaggingClassifier(mlp, n_estimators=20)

bc.fit(X,y)

y_pred = bc.predict(df2018_predict)

y_pred

array([1, 1, 0, 2, 0, 0, 2, 1, 1, 4, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

In [193]:
results['predict_Bagging_NN'] = y_pred

In [194]:
#Naive Bayes
nb = GaussianNB()

nb.fit(X, y)

y_pred = nb.predict(df2018_predict)

y_pred

array([1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

In [195]:
results['predict_NB'] = y_pred

In [196]:
#Redes Neurais
mlp = MLPClassifier(hidden_layer_sizes=(3,2), activation='identity', solver='lbfgs')
mlp.fit(X,y)

y_pred = mlp.predict(df2018_predict)

In [197]:
results['predict_NN'] = y_pred

In [198]:
#Decision Trees

dt = DecisionTreeClassifier()

dt.fit(X,y)

y_pred = dt.predict(df2018_predict)

In [199]:
results['predict_DT'] = y_pred

In [200]:
results

Unnamed: 0,predict_Bagging_20,predict_Bagging_50,predict_Bagging_100,predict_Bagging_200,predict_Bagging_NN,predict_NB,predict_NN,predict_DT
0,0,0,0,0,1,1,1,0
1,1,1,1,1,1,1,2,1
2,2,2,2,2,0,2,0,2
3,2,2,2,2,2,2,2,2
4,2,2,2,2,0,2,1,2
5,3,0,3,3,0,2,0,3
6,4,4,3,4,2,2,2,2
7,4,4,4,4,1,2,0,2
8,4,4,4,4,1,2,0,2
9,1,1,1,1,4,3,0,1


In [202]:
countries = list(df2018['country'])

In [213]:
values_list = list(results.values)

In [215]:
#transformando a quantidade de previsões em um dataframe
dicts_counts = []
for i in range(len(values_list)):
    unique, counts = np.unique(values_list[i], return_counts=True)
    dicts_counts.append(dict(zip(unique,counts)))

In [234]:
posicoes = pd.DataFrame()
for i in range(len(dicts_counts)):
    df = pd.DataFrame(dicts_counts[i], index=[countries[i]])
    posicoes = pd.concat([posicoes,df])
posicoes = posicoes.fillna(0)

In [257]:
posicoes

Unnamed: 0,0,1,2,3,4
Germany,5.0,3.0,0.0,0.0,0.0
Brazil,0.0,7.0,1.0,0.0,0.0
Portugal,2.0,0.0,6.0,0.0,0.0
Argentina,0.0,0.0,8.0,0.0,0.0
Belgium,1.0,1.0,6.0,0.0,0.0
Spain,3.0,0.0,1.0,4.0,0.0
Poland,0.0,0.0,4.0,1.0,3.0
Switzerland,1.0,1.0,2.0,0.0,4.0
France,1.0,1.0,2.0,0.0,4.0
Peru,1.0,5.0,0.0,1.0,1.0


## Resultado Final

Baseado nos vários modelos treinados, contamos quantas vezes cada modelo previu pra cada país o 1º, 2º, 3º e 4º lugar. Sendo assim chegamos aos seguintes resultados:

**1º lugar**: Brasil (7 vezes previsto como 1º lugar)

**2º lugar**: Alemanha (8 vezes previsto como 2º lugar)

**3º lugar**: Colombia (5 vezes previsto como 3º lugar)

**4º lugar**: França (4 vezes previsto como 4º lugar)