# FIFA21 Project

## Goal: predict a player's "Overall Rating" by analysing data

#### Links to review terminology:
- Explanations of the acronyms and abbreviations can be found here (https://gaming.stackexchange.com/questions/167318/what-do-fifa-14-position-acronyms-mean) and here (https://fifauteam.com/fifa-ultimate-team-positions-and-tactics/)
- FIFA 19 player rating guide (https://fifauteam.com/player-ratings-guide-fifa-19/)
- FIFA overall rating explained (https://earlygame.com/fifa/fifa-ratings-explained-overall-rating)
- FIFA 21 all attributes divided (https://fifauteam.com/fifa-21-attributes-guide/)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error


In [2]:
data = pd.read_csv('files_for_project/csv_files/fifa21_train.csv')
data

Unnamed: 0,ID,Name,Age,Nationality,Club,BP,Position,Team & Contract,Height,Weight,...,CDM,RDM,RWB,LB,LCB,CB,RCB,RB,GK,OVA
0,184383,A. Pasche,26,Switzerland,FC Lausanne-Sport,CM,CM CDM,FC Lausanne-Sport 2015 ~ 2020,"5'9""",161lbs,...,59+1,59+1,59+1,58+1,54+1,54+1,54+1,58+1,15+1,64
1,188044,Alan Carvalho,30,China PR,Beijing Sinobo Guoan FC,ST,ST LW LM,"Beijing Sinobo Guoan FC Dec 31, 2020 On Loan","6'0""",159lbs,...,53+2,53+2,57+2,53+2,48+2,48+2,48+2,53+2,18+2,77
2,184431,S. Giovinco,33,Italy,Al Hilal,CAM,CAM CF,Al Hilal 2019 ~ 2022,"5'4""",134lbs,...,56+2,56+2,59+2,53+2,41+2,41+2,41+2,53+2,12+2,80
3,233796,J. Evans,22,Wales,Swansea City,CDM,CDM CM,Swansea City 2016 ~ 2021,"5'10""",152lbs,...,58+2,58+2,56+2,57+2,58+2,58+2,58+2,57+2,14+2,59
4,234799,Y. Demoncy,23,France,US Orléans Loiret Football,CDM,CDM CM,US Orléans Loiret Football 2018 ~ 2021,"5'11""",150lbs,...,64+2,64+2,64+2,63+2,61+2,61+2,61+2,63+2,15+2,65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11696,232504,B. Böðvarsson,25,Iceland,Jagiellonia Białystok,LB,LB,Jagiellonia Białystok 2018 ~ 2021,"6'1""",168lbs,...,60+2,60+2,63+2,63+2,61+2,61+2,61+2,63+2,16+2,65
11697,214680,G. Gallon,27,France,ESTAC Troyes,GK,GK,ESTAC Troyes 2019 ~ 2022,"6'1""",174lbs,...,26+2,26+2,25+2,24+2,26+2,26+2,26+2,24+2,69+2,70
11698,221489,J. Flores,22,Chile,CD Antofagasta,RM,LM CAM RM,CD Antofagasta 2019 ~ 2024,"5'6""",143lbs,...,44+2,44+2,49+2,45+2,35+2,35+2,35+2,45+2,17+2,67
11699,146717,Anderson Silva,26,Brazil,Barnsley,CM,,Barnsley 2010,"6'2""",179lbs,...,68+0,68+0,66+0,64+0,60+0,60+0,60+0,64+0,25+0,68


# Decisions

- As we were unfamiliar with the formula and could not find an exact one in our research and observation of the dataset, we intuited the potential of the scores of their attributes plus their best position scores. Our main decisions for this approach are:
    1. OVR = ATT + IR (Overal Rating equals attributes and international reputation).
    2. For those columns with "+", we're going to keep the second number. We believe the first number can be calculated from the columns of atributes, although we are unsure as to how.
    3. By initial sum through categories in the given resource page of player profiles, we now know some columns are the sum of other columns. We decided to keep the most of the information for the model to better understand the patterns.

## Data cleaning

#### Rename Columns

In [3]:
data.columns = data.columns.str.lower().str.replace(' ','_')
data.columns = data.columns.str.lower().str.replace('/','_')

In [4]:
data = data.rename(columns={'bp':'best_position',
                            'team_&_contract':'team_contract', 
                            'w_f':'weak_foot', 'sm': 'skill_moves', 'a_w': 'attacking_work_rate', 'd_w': 'defensive_work_rate', 'ir': 'international_reputation', 'wf': 'weak_foot' })

In [5]:
data.to_csv('files_for_project/csv_files/fifa_cleaned3.csv')

#### Clean wage (€, and K symbolising thousands)

In [6]:
data['wage'] = data['wage'].str.replace('€', '')
data['wage'] = data['wage'].str.replace('K', '000')

data['wage'].astype('int')

0         4000
1        23000
2        49000
3         4000
4         2000
         ...  
11696     3000
11697     4000
11698     2000
11699        0
11700     2000
Name: wage, Length: 11701, dtype: int64

#### Clean international_reputation (★)

In [7]:
icon = '★'

data['international_reputation'] = data['international_reputation'].str.replace(icon, '')
data['international_reputation'] = data['international_reputation'].str.replace(' ', '')

data['international_reputation'].astype('int')

0        1
1        2
2        2
3        1
4        1
        ..
11696    1
11697    1
11698    1
11699    3
11700    1
Name: international_reputation, Length: 11701, dtype: int64

#### Divide the columns of each player position, and keep the numbers on the right as we already have the ones on the left

In [8]:
data_pos = data.loc[:, 'ls':'gk']

for col in data_pos.columns:
    data_pos[[col, col + '_right']] = data_pos[col].str.split('+', expand=True)

# Convertir las columnas resultantes en numéricas
data_pos = data_pos.astype(int)

data_pos = data_pos.drop(data_pos.loc[:, 'ls':'gk'].columns, axis=1)

#### Drop the columns of the sum of each category of attributes

In [9]:
data_att = data.loc[:, 'crossing':'gk_reflexes']
data_att = data_att.drop(['skill', 'movement', 'power', 'mentality', 'defending', 'goalkeeping'], axis=1)

#### Clean leftover NaNs with the means of each category

In [10]:
# data_att.isna().sum()

In [11]:
for col in data_att.columns:
    data_att[col] = data_att[col].fillna(data_att[col].mean())

# data_att.describe()

#### Concatenate all numerical dataframes

In [12]:
data_num = pd.concat([data_att, data_pos, data['international_reputation'], data['wage']], axis=1)
data_num.columns.tolist()

['crossing',
 'finishing',
 'heading_accuracy',
 'short_passing',
 'volleys',
 'dribbling',
 'curve',
 'fk_accuracy',
 'long_passing',
 'ball_control',
 'acceleration',
 'sprint_speed',
 'agility',
 'reactions',
 'balance',
 'shot_power',
 'jumping',
 'stamina',
 'strength',
 'long_shots',
 'aggression',
 'interceptions',
 'positioning',
 'vision',
 'penalties',
 'composure',
 'marking',
 'standing_tackle',
 'sliding_tackle',
 'gk_diving',
 'gk_handling',
 'gk_kicking',
 'gk_positioning',
 'gk_reflexes',
 'ls_right',
 'st_right',
 'rs_right',
 'lw_right',
 'lf_right',
 'cf_right',
 'rf_right',
 'rw_right',
 'lam_right',
 'cam_right',
 'ram_right',
 'lm_right',
 'lcm_right',
 'cm_right',
 'rcm_right',
 'rm_right',
 'lwb_right',
 'ldm_right',
 'cdm_right',
 'rdm_right',
 'rwb_right',
 'lb_right',
 'lcb_right',
 'cb_right',
 'rcb_right',
 'rb_right',
 'gk_right',
 'international_reputation',
 'wage']

## Processing data

#### Change the chosen categorical column (best_position) with OneHotEncoder to numerical

In [13]:
data_cat = pd.DataFrame(data['best_position'])

encoder = OneHotEncoder(drop='first').fit(data_cat)
encoded = encoder.transform(data_cat).toarray()

data_cat = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

#### Check Correlation between numerical dataframes and target to evaluate which columns might need dropping

In [14]:
data1_target = pd.concat([data['ova'], data_num], axis=1)
data1_target
correlation_matrix = data1_target.corr()
correlation_matrix

  correlation_matrix = data1_target.corr()


Unnamed: 0,ova,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,fk_accuracy,long_passing,...,ldm_right,cdm_right,rdm_right,rwb_right,lb_right,lcb_right,cb_right,rcb_right,rb_right,gk_right
ova,1.000000,0.390354,0.306632,0.304702,0.493152,0.360994,0.350262,0.399692,0.372304,0.479454,...,-0.008443,-0.008443,-0.008443,0.025408,0.063187,-0.046561,-0.046561,-0.046561,0.063187,0.040295
crossing,0.390354,1.000000,0.645621,0.435570,0.800162,0.674937,0.854544,0.831791,0.751382,0.740585,...,-0.085794,-0.085794,-0.085794,-0.105374,-0.047227,-0.001007,-0.001007,-0.001007,-0.047227,0.143883
finishing,0.306632,0.645621,1.000000,0.455388,0.650934,0.888847,0.820629,0.760857,0.695429,0.485792,...,-0.037664,-0.037664,-0.037664,-0.019330,-0.005118,0.088715,0.088715,0.088715,-0.005118,0.128136
heading_accuracy,0.304702,0.435570,0.455388,1.000000,0.630159,0.489435,0.531864,0.413944,0.366814,0.478891,...,-0.072665,-0.072665,-0.072665,-0.026832,-0.025207,-0.217723,-0.217723,-0.217723,-0.025207,0.157763
short_passing,0.493152,0.800162,0.650934,0.630159,1.000000,0.683227,0.839028,0.764904,0.719229,0.886005,...,-0.106755,-0.106755,-0.106755,-0.049393,-0.008647,-0.064731,-0.064731,-0.064731,-0.008647,0.180633
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
lcb_right,-0.046561,-0.001007,0.088715,-0.217723,-0.064731,0.045441,0.055077,0.033770,-0.010705,-0.105413,...,0.647764,0.647764,0.647764,0.509886,0.572440,1.000000,1.000000,1.000000,0.572440,0.451816
cb_right,-0.046561,-0.001007,0.088715,-0.217723,-0.064731,0.045441,0.055077,0.033770,-0.010705,-0.105413,...,0.647764,0.647764,0.647764,0.509886,0.572440,1.000000,1.000000,1.000000,0.572440,0.451816
rcb_right,-0.046561,-0.001007,0.088715,-0.217723,-0.064731,0.045441,0.055077,0.033770,-0.010705,-0.105413,...,0.647764,0.647764,0.647764,0.509886,0.572440,1.000000,1.000000,1.000000,0.572440,0.451816
rb_right,0.063187,-0.047227,-0.005118,-0.025207,-0.008647,-0.025835,-0.005715,-0.028331,-0.054342,-0.024910,...,0.736960,0.736960,0.736960,0.921632,1.000000,0.572440,0.572440,0.572440,1.000000,0.762873


#### Normalize attributes dataframe

In [15]:
tra = MinMaxScaler().fit(data_num)
nor = tra.transform(data_num)

data_nor = pd.DataFrame(nor, columns = data_num.columns)

data_nor

Unnamed: 0,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,fk_accuracy,long_passing,ball_control,...,rdm_right,rwb_right,lb_right,lcb_right,cb_right,rcb_right,rb_right,gk_right,international_reputation,wage
0,0.545455,0.478261,0.431818,0.720930,0.465116,0.615385,0.444444,0.561798,0.642857,0.637363,...,0.750,0.666667,0.666667,0.750,0.750,0.750,0.666667,0.333333,0.00,0.007143
1,0.681818,0.826087,0.806818,0.697674,0.837209,0.857143,0.822222,0.752809,0.642857,0.813187,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.25,0.041071
2,0.761364,0.793478,0.329545,0.813953,0.825581,0.879121,0.944444,0.966292,0.773810,0.879121,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.25,0.087500
3,0.431818,0.423913,0.602273,0.627907,0.372093,0.538462,0.411111,0.460674,0.571429,0.615385,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.007143
4,0.488636,0.369565,0.636364,0.697674,0.348837,0.648352,0.444444,0.449438,0.619048,0.670330,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.003571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11696,0.636364,0.228261,0.522727,0.593023,0.279070,0.571429,0.500000,0.269663,0.535714,0.593407,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.005357
11697,0.068182,0.119565,0.090909,0.244186,0.139535,0.131868,0.122222,0.157303,0.214286,0.142857,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.007143
11698,0.659091,0.684783,0.522727,0.651163,0.430233,0.725275,0.588889,0.370787,0.583333,0.681319,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.003571
11699,0.659091,0.684783,0.522727,0.755814,0.477527,0.758242,0.506865,0.651685,0.750000,0.769231,...,0.625,0.500000,0.500000,0.625,0.625,0.625,0.500000,0.000000,0.50,0.000000


#### Concatenate both encoded and normalized data

In [16]:
data1 = pd.concat([data_cat, data_nor], axis=1)
# data1.describe().T
# data1.dtypes

#### Train-Test Split

In [17]:
X = pd.concat([data1], axis=1)
y = data['ova']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 42)

# Linear Regression

#### Train and apply model

In [18]:
lm = linear_model.LinearRegression()
lm.fit(X_train,y_train)

In [19]:
predictions = lm.predict(X_train)
r2_score(y_train, predictions)

0.8993499914468062

## Model Validation

#### R2

In [20]:
predictions_test = lm.predict(X_test) # The R2 test should be similar to the train test. That means is a strong model.
r2_score(y_test, predictions_test)

0.8931403562599238

#### MAE

In [21]:
mae = mean_absolute_error(y_test, predictions_test) # It is under 2, so it´s good. We can also compare this with the range of the target. It should be similar to the RMSE. But the RMSE is going to be bigger because of the squaring.
mae

1.6948013216810773

#### MSE

In [22]:
mse=mean_squared_error(y_test,predictions_test)
mse

4.842268547998138

#### RMSE

In [23]:
rmse = np.sqrt(mean_squared_error(y_test,predictions_test)) # usually bigger than MAE. It is affected by outliers.
rmse

2.200515518690595

***

## Clean and Preprocess functions for upcoming data to validate

In [34]:
def prep_data (df):

    # Rename columns
    df.columns = df.columns.str.lower().str.replace(' ','_')
    df.columns = df.columns.str.lower().str.replace('/','_')
    df = df.rename(columns={'bp':'best_position',
                                'team_&_contract':'team_contract', 
                                'w_f':'weak_foot', 'sm': 'skill_moves', 'a_w': 'attacking_work_rate', 'd_w': 'defensive_work_rate', 'ir': 'international_reputation', 'wf': 'weak_foot' })
    # Edit wage   
    df['wage'] = df['wage'].str.replace('€', '')
    df['wage'] = df['wage'].str.replace('K', '000')
    df['wage'].astype('int')

    # Edit positions
    data_pos = df.loc[:, 'ls':'gk']
    for col in data_pos.columns:
        data_pos[[col, col + '_right']] = data_pos[col].str.split('+', expand=True)
    data_pos = data_pos.astype(int)
    data_pos = data_pos.drop(data_pos.loc[:, 'ls':'gk'].columns, axis=1)
    
    # Edit attributes
    data_att = df.loc[:, 'crossing':'gk_reflexes']
    data_att = data_att.drop(['skill', 'movement', 'power', 'mentality', 'defending', 'goalkeeping'], axis=1)
    
    # Edit international reputation
    icon = '★'
    df['international_reputation'] = df['international_reputation'].str.replace(icon, '')
    df['international_reputation'] = df['international_reputation'].str.replace(' ', '')
    df['international_reputation'].astype('int')
    
    # Edit best_position
    data_cat = pd.DataFrame(df['best_position'])

    # Encode categorical data
    encoded = encoder.transform(data_cat).toarray()
    data_cat = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
    
    # Fill nans
    for col in data_att.columns:
        data_att[col] = data_att[col].fillna(data_att[col].mean())
    
    # Concat 
    data_num = pd.concat([data_att, data_pos, df['international_reputation'], df['wage']], axis=1)
    
    
    # Normalise
    nor = tra.transform(data_num)
    data_nor = pd.DataFrame(nor, columns = data_num.columns)
    
    # concatenate final
    df = pd.concat([data_cat, data_nor], axis=1)
    
    return df

***

## Data validation

### We will import and prepare the dataset to validate with our trained model

#### Clean and prep the dataset

In [35]:
data2 = pd.read_csv('files_for_project/csv_files/fifa21_validate.csv')

In [36]:
data_validate_ready = prep_data(data2)
data_validate_ready

Unnamed: 0,best_position_CB,best_position_CDM,best_position_CF,best_position_CM,best_position_GK,best_position_LB,best_position_LM,best_position_LW,best_position_LWB,best_position_RB,...,rdm_right,rwb_right,lb_right,lcb_right,cb_right,rcb_right,rb_right,gk_right,international_reputation,wage
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.008929
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.005357
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.000893
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.000893
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.023214
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.001786
1995,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.00,0.001250
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.875,0.833333,0.833333,0.875,0.875,0.875,0.833333,0.666667,0.25,0.016071
1997,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.750,0.833333,0.833333,0.625,0.625,0.625,0.833333,0.666667,0.00,0.007143


### Validate given data with our model

#### R2 

In [37]:
X_validate = data_validate_ready
y_validate = data_validate['ova']

predictions_test_validate = lm.predict(X_validate)
r2_score(y_validate, predictions_test_validate)

0.8975604484355315

#### MAE

In [38]:
mae = mean_absolute_error(y_validate, predictions_test_validate)
mae

1.6766368196764279

#### MSE

In [39]:
mse = mean_squared_error(y_validate, predictions_test_validate)
mse

4.687785596732207

#### RMSE

In [40]:
rmse = np.sqrt(mse)
rmse

2.1651294641965886

## Conclusions

- We could see that these validation scores help us see that there is some dependancy on the stats, giving 1 or 2 numbers off in the mark of difference.
- To improve our model, we compared our initial with a reduced correlation (dropping any corr under 0.2) and one without this condition plus adding the wage, as we believe their salary could influence this value aside from the formula ones identified.
- We could see a slight improvement in the model with a +0.01 change in the R2 score, and one of -0.06 on the RSME score.
- We could see that these validation scores help us see that there is some dependancy on the stats, with some slight influence regarding wage.