<h1>EXPLANATION</h1>

In a first instance, having only the training data, we decided to use only the numerical data, and we obtained an r2 of 0.90, but at the time of receiving the data to predict, we had problems with the columns (some that we used in the model were not in the data to predict), and having little time, we decided to use only two columns (the most correlated ones). Have we lost too much information? Yes, but the model has an R2 of 0.71 with only two columns, so it is an acceptable result. The predictions compared to the actual results have an R2 of 0.708, so the model is perfectly trained. Using more columns will give a better result, so those would be the next steps. We have done an experimental model only with goalkeepers, and it has obtained an R2 0.96, so it would be interesting to look at this approximation.

<h2>IMPORTING THE REQUIRED LIBRARIES</h2>

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from math import sqrt
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import pickle

<h2>IMPORTING THE DATA INTO A DATAFRAME</h2>

In [2]:
def import_data(path):

    dataframe = pd.read_csv(path).drop('Unnamed: 0', axis = 1)

    return dataframe

fifa_21_data = import_data('fifa21_training.csv')
fifa_new_data = import_data('fifa_new_data.csv')[['mentality_composure', 'movement_reactions']]
names_new_data = import_data('fifa_new_data.csv')['long_name']

<h2>NORMALIZING THE COLUMNS OF THE DATAFRAME</h2>

In [3]:
def normalize_columns(dataframe):

    normalised_headers = []

    for column in dataframe.columns:
        
        column = column.lower()
        column = column.replace(' ','_')
        normalised_headers.append(column)

    dataframe.columns = normalised_headers

    return dataframe

fifa_21_data = normalize_columns(fifa_21_data)

<h2>SEPARATING THE GOALKEEPERS FROM THE STANDARD PLAYERS</h2>

In [4]:
def separate_goalkeepers(dataframe, column, value):

    dataframe_goalkeepers = dataframe[dataframe[column] == value]

    return dataframe_goalkeepers

fifa_21_data_gk = separate_goalkeepers(fifa_21_data, 'position', 'GK')

<h2>GETTING ONLY THE INTERNATIONAL REPUTATION</h2>

In [5]:
def international_reputation(dataframe, column):
    
    column_data = dataframe[column]

    return column_data

# internation_reputation_data = international_reputation(fifa_21_data, ['ir'])

<h2>SELECTING ONLY THE NUMERICAL DATA</h2>

In [6]:
def get_numerical_data(dataframe, to_avoid_columns):

    numerical_data = dataframe.select_dtypes(include = np.number).drop(to_avoid_columns, axis = 1)

    return numerical_data

fifa_21_data = get_numerical_data(fifa_21_data, ['id'])
fifa_21_data_gk_numerical = get_numerical_data(fifa_21_data_gk, ['id'])

<h2>SELECT ONLY THE GOALKEEPING AND BASE STATS COLUMNS FOR THE GOALKEEPERS DATA</h2>

In [7]:
def select_columns(dataframe, columns):

    dataframe = dataframe[columns]

    return dataframe

fifa_21_data = select_columns(fifa_21_data, ['composure', 'reactions', 'ova'])
fifa_21_data.columns = ['mentality_composure', 'movement_reactions', 'ova']

fifa_21_data_gk_numerical = select_columns(fifa_21_data_gk_numerical, ['goalkeeping', 'base_stats', 'ova'])

<h2>FILLING NAN VALUES WITH THE MEAN</h2>

In [8]:
def fill_nan_values(dataframe):

    dataframe_columns = dataframe.columns

    for column in dataframe_columns:
        if dataframe[column].isna().sum() > 0:
            dataframe[column] = dataframe[column].fillna(dataframe[column].mean())

    return dataframe

fifa_21_data = fill_nan_values(fifa_21_data)
fifa_new_data = fill_nan_values(fifa_new_data)
fifa_21_data_gk_numerical = fill_nan_values(fifa_21_data_gk_numerical)

<h2>ENCODING THE INTERNATIONAL REPUTATION</h2>

In [9]:
def international_reputation_encode(dataframe):

    dataframe = dataframe['ir'].replace({'1 ★': 0, '2 ★': 1, '3 ★': 2, '4 ★': 3, '5 ★': 4})
    
    return dataframe

# internation_reputation_data = international_reputation_encode(internation_reputation_data)
# fifa_21_data['ir'] = internation_reputation_data

<h2>SPLITTING THE DATA INTO TWO DATA STRUCTURES: X AND Y</h2>

In [10]:
def xy_split(dataframe, target):

    X = dataframe.drop(target, axis = 1)
    y = dataframe[target]

    return X, y

X, y = xy_split(fifa_21_data, 'ova')
X_gk, y_gk = xy_split(fifa_21_data_gk_numerical, 'ova')

<h2>SPLITTING THE DATA INTO THE TRAINING AND TESTING STRUCTURES</h2>

In [11]:
def training_testing_split(x_dataframe, y_array):

    X_train, X_test, y_train, y_test = train_test_split(x_dataframe, y_array, test_size = 0.10, random_state = 42)

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = training_testing_split(X, y)
X_train_gk, X_test_gk, y_train_gk, y_test_gk = training_testing_split(X_gk, y_gk)

<h2>SCALING THE DATA USING THE STANDARDSCALER() FUNCTION FROM SKLEARN</h2>

In [12]:
def scale_data(x_training, x_testing):

    standard_scaler = StandardScaler().fit(x_training)

    training_array_scaled = standard_scaler.transform(x_training)
    testing_array_scaled = standard_scaler.transform(x_testing)

    x_training_scaled = pd.DataFrame(data = training_array_scaled, columns = x_training.columns)
    x_testing_scaled = pd.DataFrame(data = testing_array_scaled, columns = x_testing.columns)

    return x_training_scaled, x_testing_scaled, standard_scaler

X_train_scaled, X_test_scaled, scaler = scale_data(X_train, X_test)

fifa_new_data_array = scaler.transform(fifa_new_data)
fifa_new_data_scaled = pd.DataFrame(data = fifa_new_data_array, columns = fifa_new_data.columns)

X_train_scaled_gk, X_test_scaled_gk, scaler_gk = scale_data(X_train_gk, X_test_gk)

<h2>CREATING AND FITTING A REGRESSION MODEL</h2>

In [13]:
def create_fit_model(x_training, y_training, selected_model):

    model = selected_model
    model.fit(x_training, y_training)

    return model

linear_regression = create_fit_model(X_train_scaled, y_train, LinearRegression())
linear_regression_gk = create_fit_model(X_train_scaled_gk, y_train_gk, LinearRegression())

<h2>GENERATING THE METRICS OF THE MODEL</h2>

In [14]:
def model_metrics(model, testing_x, testing_y):

    predictions = model.predict(testing_x)

    score = model.score(testing_x, testing_y)
    r2 = r2_score(predictions, testing_y)
    mse = mean_squared_error(predictions, testing_y)
    mae = mean_absolute_error(predictions, testing_y)
    rmse = sqrt(mse)

    return score, r2, mse, mae, rmse

score, r2, mse, mae, rmse = model_metrics(linear_regression, X_test_scaled, y_test)
score_gk, r2_gk, mse_gk, mae_gk, rmse_gk = model_metrics(linear_regression_gk, X_test_scaled_gk, y_test_gk)

<h2>MODEL RESULTS</h2>

In [15]:
print('\nRESULTS AND METRICS OF THE MODEL - ALL TYPE OF PLAYERS')
print('------------------------------------------------------\n')

print(f'SCORE: {round(score, 2)}')
print(f'R2 SCORE: {round(r2, 2)}')
print(f'MEAN SQUARED ERROR: {round(mse, 2)}')
print(f'MEAN ABSOLUTE ERROR: {round(mae, 2)}')
print(f'ROOT MEAN SQUARED ERROR: {round(rmse, 2)}')


RESULTS AND METRICS OF THE MODEL - ALL TYPE OF PLAYERS
------------------------------------------------------

SCORE: 0.79
R2 SCORE: 0.71
MEAN SQUARED ERROR: 11.01
MEAN ABSOLUTE ERROR: 2.59
ROOT MEAN SQUARED ERROR: 3.32


<h2>MODEL RESULTS - ONLY GOALKEEPERS - EXPERIMENTAL APPROACH</h2>

In [16]:
print('\nRESULTS AND METRICS OF THE MODEL - ONLY GOALKEEPERS')
print('---------------------------------------------------\n')

print(f'SCORE: {round(score_gk, 2)}')
print(f'R2 SCORE: {round(r2_gk, 2)}')
print(f'MEAN SQUARED ERROR: {round(mse_gk, 2)}')
print(f'MEAN ABSOLUTE ERROR: {round(mae_gk, 2)}')
print(f'ROOT MEAN SQUARED ERROR: {round(rmse_gk, 2)}')


RESULTS AND METRICS OF THE MODEL - ONLY GOALKEEPERS
---------------------------------------------------

SCORE: 0.96
R2 SCORE: 0.96
MEAN SQUARED ERROR: 2.15
MEAN ABSOLUTE ERROR: 0.78
ROOT MEAN SQUARED ERROR: 1.47


<h2>PREDICTIONS FOR THE NEW DATA</h2>

In [17]:
predictions = linear_regression.predict(fifa_new_data_scaled)
final_predictions = pd.DataFrame(data = zip(names_new_data.tolist(), predictions), columns = ['name', 'ova_predictions'])
final_predictions.sort_values(by = 'ova_predictions', ascending = False)

Unnamed: 0,name,ova_predictions
511,Toby Alderweireld,83.364757
758,Jan Oblak,81.986404
288,Paulo Bruno Exequiel Dybala,81.456102
761,Jadon Sancho,81.024409
154,Rodrigo Hernández Cascante,80.435488
...,...,...
217,Ted Tattermusch,50.751094
706,Marius Bildøy,50.402672
929,Paul Martin,48.410746
390,Callum King-Harmes,47.556674


<h2>EXPORT THE MODEL FOR OTHER USES</h2>

In [18]:
filename = 'model.sav'
pickle.dump(linear_regression, open(filename, 'wb'))

<h2>REAL RESULTS</h2>

In [19]:
results = [52, 62, 57, 59, 66, 65, 62, 73, 70, 62, 58, 72, 52, 68, 69, 67, 71, 61, 72, 70, 73, 64, 71, 56, 76, 56, 68, 63, 66, 67, 62, 70, 64, 62, 66, 70, 70, 66, 66, 70, 57, 55, 64, 64, 51, 58, 62, 60, 68, 67, 68, 71, 66, 56, 65, 60, 71, 72, 67, 75, 63, 67, 78, 66, 61, 69, 73, 64, 64, 67, 78, 64, 60, 58, 69, 68, 68, 82, 71, 54, 68, 70, 56, 69, 64, 62, 70, 69, 65, 79, 63, 61, 62, 63, 60, 63, 66, 71, 64, 68, 74, 71, 62, 65, 74, 57, 66, 71, 64, 64, 65, 65, 49, 75, 62, 67, 63, 75, 61, 69, 61, 72, 65, 60, 65, 61, 66, 67, 65, 70, 65, 64, 76, 76, 62, 67, 62, 70, 70, 60, 60, 68, 54, 71, 69, 72, 64, 66, 64, 67, 59, 59, 71, 56, 85, 79, 72, 65, 57, 73, 76, 67, 53, 66, 65, 67, 72, 75, 63, 70, 58, 66, 69, 88, 67, 62, 63, 72, 68, 69, 62, 64, 54, 75, 61, 58, 67, 77, 59, 49, 64, 68, 70, 72, 55, 62, 70, 77, 65, 81, 55, 61, 59, 70, 59, 64, 58, 64, 69, 63, 56, 58, 64, 67, 50, 67, 64, 51, 68, 56, 54, 64, 67, 60, 70, 78, 66, 72, 72, 66, 70, 76, 65, 62, 73, 68, 61, 72, 58, 56, 71, 60, 73, 65, 65, 74, 73, 67, 69, 65, 73, 69, 71, 74, 66, 74, 75, 79, 72, 69, 64, 66, 65, 72, 63, 65, 67, 69, 78, 51, 65, 75, 60, 67, 69, 64, 57, 63, 76, 64, 72, 59, 73, 65, 70, 80, 72, 59, 88, 61, 69, 61, 65, 69, 66, 70, 66, 60, 67, 68, 75, 62, 52, 69, 65, 81, 86, 72, 72, 66, 52, 56, 78, 62, 71, 78, 58, 60, 68, 64, 73, 53, 75, 68, 60, 77, 66, 73, 63, 71, 67, 65, 80, 77, 70, 66, 69, 65, 53, 52, 74, 65, 65, 67, 69, 70, 60, 59, 69, 68, 54, 82, 68, 75, 69, 71, 70, 79, 67, 58, 79, 64, 57, 68, 68, 66, 66, 73, 64, 81, 60, 69, 52, 59, 57, 68, 67, 55, 72, 76, 75, 64, 74, 65, 67, 59, 65, 66, 72, 73, 51, 67, 64, 62, 66, 68, 52, 56, 64, 70, 78, 71, 59, 69, 61, 62, 66, 64, 66, 67, 61, 72, 66, 64, 70, 70, 54, 74, 68, 64, 69, 65, 62, 76, 63, 66, 61, 72, 69, 76, 65, 76, 61, 52, 80, 67, 63, 60, 68, 66, 67, 59, 67, 72, 60, 51, 62, 81, 71, 69, 56, 67, 68, 69, 63, 65, 69, 62, 65, 71, 67, 66, 61, 73, 61, 51, 62, 61, 75, 65, 76, 68, 69, 65, 62, 64, 64, 73, 75, 71, 69, 58, 67, 60, 52, 65, 58, 77, 50, 80, 70, 68, 66, 69, 69, 60, 69, 61, 68, 80, 77, 67, 65, 74, 66, 65, 68, 78, 62, 80, 60, 87, 83, 68, 54, 67, 68, 60, 67, 61, 64, 59, 66, 73, 72, 58, 73, 72, 60, 53, 71, 65, 74, 73, 56, 75, 71, 64, 64, 66, 64, 70, 73, 78, 75, 53, 79, 73, 63, 67, 56, 73, 62, 54, 67, 63, 71, 69, 74, 74, 76, 68, 68, 64, 58, 62, 60, 63, 68, 68, 66, 75, 54, 70, 74, 62, 60, 67, 74, 73, 74, 55, 79, 67, 60, 68, 64, 50, 75, 63, 72, 57, 65, 66, 71, 59, 63, 57, 55, 68, 57, 67, 73, 52, 66, 68, 67, 56, 70, 69, 66, 63, 73, 65, 51, 61, 61, 78, 68, 65, 55, 64, 60, 62, 66, 67, 53, 67, 72, 64, 52, 65, 59, 70, 65, 79, 67, 75, 65, 61, 77, 63, 53, 61, 71, 69, 78, 48, 69, 63, 53, 67, 64, 76, 76, 60, 63, 66, 62, 67, 66, 67, 69, 68, 78, 62, 74, 72, 72, 65, 63, 59, 71, 68, 67, 70, 71, 65, 62, 58, 84, 68, 63, 62, 71, 68, 62, 78, 82, 67, 72, 79, 68, 69, 68, 60, 61, 76, 69, 72, 66, 68, 77, 62, 57, 66, 53, 62, 63, 63, 59, 74, 70, 72, 61, 66, 82, 69, 58, 70, 65, 69, 72, 67, 66, 76, 70, 71, 71, 66, 77, 59, 74, 68, 68, 83, 67, 63, 63, 69, 64, 67, 63, 64, 62, 61, 48, 72, 53, 49, 59, 77, 74, 67, 66, 60, 74, 58, 91, 71, 60, 84, 73, 68, 71, 66, 54, 65, 64, 64, 63, 69, 75, 71, 68, 62, 70, 65, 79, 57, 65, 65, 70, 66, 61, 61, 74, 59, 54, 59, 64, 75, 50, 66, 63, 69, 70, 66, 63, 57, 66, 76, 54, 73, 68, 62, 64, 62, 64, 67, 70, 75, 68, 57, 70, 64, 63, 51, 74, 80, 72, 65, 72, 53, 65, 77, 72, 63, 69, 57, 69, 65, 65, 67, 70, 79, 69, 55, 73, 66, 61, 77, 68, 66, 70, 73, 71, 70, 72, 66, 66, 64, 60, 67, 63, 58, 51, 61, 71, 65, 80, 75, 74, 64, 69, 62, 73, 65, 66, 72, 64, 68, 62, 56, 78, 78, 70, 73, 52, 68, 61, 72, 61, 60, 67, 69, 79, 66, 65, 76, 66, 74, 63, 71, 66, 71, 66, 54, 70, 71, 65, 64, 68, 66, 71, 66, 79, 64, 78, 64, 54, 70, 59, 59, 76, 70, 78, 65, 61, 68, 71, 63, 65, 67, 71, 64, 65, 62, 77, 48, 75, 67, 75, 68, 64, 67, 62, 64, 59, 64, 73, 58, 85, 63, 65, 62, 69, 72, 75, 59, 71, 55, 70, 70, 64, 66, 61, 64, 54, 72, 54, 66, 62, 63, 76, 69, 69, 73, 67, 74, 71, 56, 71, 63, 67, 68, 65, 80, 60, 58, 65, 75, 66, 70, 63, 69, 61, 69, 72, 67, 65, 67, 70, 72, 68, 67, 73, 72, 59, 68]
r2_final = r2_score(predictions, results)
print(f'REAL R2 FROM NEW DATA: {r2_final}')

REAL R2 FROM NEW DATA: 0.7080565873767162
