# Values of Soccer Players in FIFA 2019

## 5-fold cross-validation
We want to predict Value (EUR) based on the features:
- Age 
- Overall
- Potential
- International Reputation
- Weight (kg)
- Height (cm)

In [1]:
import pandas as pd 

report = pd.read_csv('/Users/WoodPecker/PycharmProjects/JupyterProject/fifa_clean.csv', header=0)

In [2]:
import math
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import KFold

def process(x, y):
    reg = linear_model.LinearRegression()
    kf = KFold(n_splits=5, shuffle=True)
    kf.get_n_splits(x)
    RMSE = 0
    counter = 0
    for train_index, test_index in kf.split(x):
        reg.fit(x[train_index], y[train_index])
        MSE = sum((reg.predict(x[test_index]) - y[test_index])**2) / len(y[test_index])
        counter += 1
        print("Round ", counter, ": MSE = ", MSE)
        RMSE +=  math.sqrt(MSE)
    RMSE = RMSE / 5    
    print("RMSE: ", RMSE)

    
x = report.iloc[:,[4,6,7,10,15,14]].values
y = report.iloc[:,49].values 

process(x, y)

Round  1 : MSE =  253.10567975276996
Round  2 : MSE =  261.1915755218137
Round  3 : MSE =  268.08478742577165
Round  4 : MSE =  258.4089315341406
Round  5 : MSE =  308.8658389932922
RMSE:  16.418738994001394


## Additional Features
We improve our model by adding the following features:
- Nationality (one-hot-encoded)
- SprintSpeed
- ShotPower
- Stamina
- Penalties
- Derived Feature as a sum of:
    - Agility
    - Reactions
    - Balance

In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
enc = OneHotEncoder(handle_unknown='ignore')

x = report.iloc[:,[4,5,6,7,10,14,15,26,30,32,39]].values

derived = [report.iloc[:, [27,28,29]].sum(axis=1).values]
derived = np.transpose(derived)
x = np.column_stack((x, derived))

labelencoder = LabelEncoder()
x[:, 1] = labelencoder.fit_transform(x[:, 1])
onehotencoder = OneHotEncoder()
onehotencoder.fit(x)
x = onehotencoder.transform(x)

process(x, y)


Round  1 : MSE =  135.26125393638918
Round  2 : MSE =  169.50260327734313
Round  3 : MSE =  160.04144559212196
Round  4 : MSE =  154.319997338674
Round  5 : MSE =  159.62473638387848
RMSE:  12.471416179023977


## Achieving a RMSE close to 0

A RMSE close to 0 can be achieved by <b>Overfitting</b>. By adding many features and/or unique features (i.e. player names or IDs). 
As a consequence our model learns the data by heart and the value of the RMSE approximates 0.
This would not be a practical solution because newly added data will not be predicted correctly.
The effort of adding an extensive amount of features is also high.