# Predicting performance direction of Premier League Footballers using 'Fantasy Football' data and Machine learning

Using a LTSM neural network trained with data collected from 3 seasons of Premier League Football games, a predictive model of player performance was made. By 'direction of performance,' it is simply meant whether a player performed better or worse in the next matched player relative to the previous week, i.e. a simple binary choice. Thus, a categorical model is used here.

Below the construction of this model is outlined.

### Importing and normalizing data
The data being used to train the model has been collected from the Official Premier League online game 'Fantasy Premier League." This online game tracks multiple performance attributes for the sport, such as number of shots or number of fouls committed by each player every week, and based on these variables gives each player a score every week. The score that each player gets depends primarily on the number of bonus points ('bps') they recieve, which is determined by set rules- for example 4 bps if they score a goal, 2 bps for playing 90 minutes (the length of a premier league football match) or -2 points if the player receives a red card. These rules are not of direct consequence for the analysis but the full list is found here- https://fantasy.premierleague.com/help/rules.

The number of bps awarded to a given player additionally depends on how well that player performs relative to other players in the same match for that week. Thus, bps, and total points awarded per week, are sufficiently random variables that prediction of them is a problem well suited for machine learning, given the large number of variables which may impact these dependent variables.

The first step for the analysis is to import the data, which is seperated into data for each season, and within each season separated into individual .csv files for each player. Each file then has various attribute values recorded for each week (of which there are 38 weeks). We begin by importing the required modules and defining the classification function, which will be used for defining the performance direction:

In [1]:
import pandas as pd
import numpy as np
import os
import datetime

def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0

Next, the csv files need to be processed. Using the file 'raw_data/2016-17/players/Sergio_Agüero/gw.csv' as an example to show the format of these files, we have:

In [2]:
dat = pd.read_csv("raw_data/2016-17/players/Sergio_Agüero/gw.csv")
dat.columns

Index(['assists', 'attempted_passes', 'big_chances_created',
       'big_chances_missed', 'bonus', 'bps', 'clean_sheets',
       'clearances_blocks_interceptions', 'completed_passes', 'creativity',
       'dribbles', 'ea_index', 'element', 'errors_leading_to_goal',
       'errors_leading_to_goal_attempt', 'fixture', 'fouls', 'goals_conceded',
       'goals_scored', 'ict_index', 'id', 'influence', 'key_passes',
       'kickoff_time', 'kickoff_time_formatted', 'loaned_in', 'loaned_out',
       'minutes', 'offside', 'open_play_crosses', 'opponent_team', 'own_goals',
       'penalties_conceded', 'penalties_missed', 'penalties_saved',
       'recoveries', 'red_cards', 'round', 'saves', 'selected', 'tackled',
       'tackles', 'target_missed', 'team_a_score', 'team_h_score', 'threat',
       'total_points', 'transfers_balance', 'transfers_in', 'transfers_out',
       'value', 'was_home', 'winning_goals', 'yellow_cards'],
      dtype='object')

To determine which variables potentially correlate with the target variables 'bps' and 'total_points,' the variable which the model will predict, use the corr() function:

In [3]:
dat["future_points"] = dat['bps'].shift(+1)
dat.corr()[["bps","future_points"]]

Unnamed: 0,bps,future_points
assists,0.500451,-0.194994
attempted_passes,0.586816,0.170655
big_chances_created,0.333012,-0.004426
big_chances_missed,0.048586,0.197941
bonus,0.90967,-0.015912
bps,1.0,-0.013018
clean_sheets,0.28278,0.242436
clearances_blocks_interceptions,0.18806,0.089263
completed_passes,0.614451,0.130888
creativity,0.728404,0.156208


Since the data set being used is large, and given that players in different positions will likely have different dependancies on different variables (for example, Sergio_Agüero is a forward so his bps is potentially more dependent on the variable 'creativity' and less dependent on 'clean_sheets' than a defender would be) we will initially use only the most general variables that have the highest correlation with bps:

In [4]:
variables = ["assists", "attempted_passes", "big_chances_missed", "clean_sheets", "completed_passes",
             "creativity", "dribbles", "goals_conceded", "goals_scored", "ict_index", "key_passes",
            "transfers_in", "transfers_out", "value", "bps", "total_points", "future_points", "round", "minutes"]
dat = dat[variables]
dat.head()

Unnamed: 0,assists,attempted_passes,big_chances_missed,clean_sheets,completed_passes,creativity,dribbles,goals_conceded,goals_scored,ict_index,key_passes,transfers_in,transfers_out,value,bps,total_points,future_points,round,minutes
0,0,33,0,0,26,29.6,1,1,1,14.3,2,0,0,130,33,9,,1,90
1,0,21,0,0,17,13.3,2,1,2,16.7,1,75648,12138,130,57,13,33.0,2,82
2,0,27,0,0,21,17.5,4,1,0,5.2,1,213510,8885,131,6,2,57.0,3,87
3,0,0,0,0,0,0.0,0,0,0,0.0,0,32042,984971,130,0,0,6.0,4,0
4,0,0,0,0,0,0.0,0,0,0,0.0,0,11435,167808,130,0,0,0.0,5,0


A general function is now required to loop through all the files and select the required columns as shown above. Whilst most files are 38 data entries long, there are some anomolies in length which need to be accounted for, which is done below. 

In particular, some of the .csv files many not be of length 38 because the player they correspond too didn't play for a club in the premier league for the whole season. In this case, the rounds are re-labelled, to provide some consistency in how the data is formatted, which is necessary as later the rounds will be used as an index. N.b. the fact that the round indices may not be correctly labelled (as some players may play for the first half of the season and others the second etc, and this detail is lost in the above function's relabelling) is not relevant, as the network will be trained to analyse short-term patterns, and thus the time of season is not a factor.

In [5]:
def Select_Columns():
    new_directory = f"filtered_data_{datetime.date.today()}"
    os.mkdir(f"{new_directory}")
    rounds_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                    21, 22, 23, 24,25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]
    
    # Fix incorrect indexing in data:
    years = [[16, 17], [17, 18], [18, 19]]
    for year in years:
        player_list = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/player_idlist.csv')
        for i in range(len(player_list)):
            try:
                player = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/players/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}/gw.csv')
                if len(player["round"]) == 38:
                    if player["round"].iloc[-1] != 38:
                        player["round"].iloc[-1] = 38
                difference_length = 38 - len(player["round"])
                if len(player["round"]) == 38:
                    player["round"] = rounds_array
                else:
                    player["round"] = rounds_array[difference_length:]
                player["future_points"] = player['bps'].shift(+1)
                player = player[variables]
                with open(f"{new_directory}/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{year[0]}.csv", 'w') as file:
                    player.to_csv(file, header=True, mode='w',index=False)
                        
            except:
                player = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/players/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{player_list.iloc[i][2]}/gw.csv')
                if len(player["round"]) == 38:
                    if player["round"].iloc[-1] != 38:
                        player["round"].iloc[-1] = 38
                difference_length = 38 - len(player["round"])
                if len(player["round"]) == 38:
                    player["round"] = rounds_array
                else:
                    player["round"] = rounds_array[difference_length:]
                player["future_points"] = player['bps'].shift(+1)
                player = player[variables]
                with open(f"{new_directory}/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{year[0]}.csv", 'w') as file:
                    player.to_csv(file, header=True, mode='w',index=False)
                        
    return new_directory

working_dir = Select_Columns()

Next, normalize and then structure the data in preparation for passing onto the LTSM:

In [6]:
def Normalise(passed_directory):
    """ Add new column, 'Performance_direction,' to label whether a player performed better
    or worse next week relative to the current week:"""
    
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            df["performance_direction"] = list(map(classify, df["future_points"], df["total_points"]))
            with open(f"{passed_directory}/{filename}", 'w') as file:
                df.to_csv(file, header=True, mode='w',index=False)
    
    """ Create a new dict to hold the maxiumum values for each variable, which can then
    be used to normalise the data. The column 'round' is dropped as it is not a variable,
    and 'performance_direction' is added."""
    
    max_values= np.zeros(19)
    variables.remove("round")
    variables.append("performance_direction")
    variables_max_values = dict(zip(variables, max_values))
    """ loop through all files to determine max values"""
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            if len(df.index) > 1:
                for var, value in variables_max_values.items():
                    if df.at[df[var].idxmax(axis=1), var] > value:
                        variables_max_values[var] = df.at[df[var].idxmax(axis=1), var]
            else:
                pass
            
    """ loop through a second time to normalise"""
    os.mkdir(f"Normalised_{passed_directory}")
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            for var, value in variables_max_values.items():
                df[var]= df[var].div(value)
    
        """ save file for later if there are more than 30 non-zero entries"""
        if df["minutes"][df["minutes"] > 0].count() > 30:
            with open(f"Normalised_{passed_directory}/{filename}", 'w') as file:
                df.to_csv(file, header=True, mode='w', index=False)
    new_directory = f"Normalised_{passed_directory}"
    return variables_max_values, new_directory
    
normlisation_factors, working_dir = Normalise(working_dir)

Now that the data is normalised it must be re-grouped in preparation for the LTSM. The length in the function below refers to the number of periods of data that will be feed into the model at once, i.e. Length = 3 corresponds to 3 weeks of match data. In order for temporal patterns in the data to be abstracted, it is necessary for the order to be retained in each data set, which is what the function deque allows for below.

In [7]:
from collections import deque
import random

def Sequence_data(passed_directory, length):
    j=0
    SEQ_LEN = length
    sequential_data = []  # this is a list that will CONTAIN the sequences
    for filename in os.listdir(f"{passed_directory}"):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}", index_col="round")
            input_df = df[["transfers_in", "transfers_out",'bps', 'ict_index', 'minutes', 'performance_direction']]  # performance predictior network from 'bps', 'ict_index', 'minutes'
            prev_days = deque(maxlen=SEQ_LEN)  
            for i in input_df.values: 
                prev_days.append([n for n in i[:-1]])  
                if len(prev_days) == SEQ_LEN:  
                    sequential_data.append([np.array(prev_days), i[-1]])  
    random.shuffle(sequential_data)
    #print(np.shape(sequential_data))
    #normal_data = min_max_scaler.fit_transform(np.array(sequential_data))
    return sequential_data


Length = 3
one= Sequence_data(working_dir, Length)


In [8]:
x = []
Y = []
for features, result in one:
    x.append(features)
    Y.append(result)
x = np.array(x, dtype="float32")

In [9]:
import tensorflow as tf
import sklearn
from sklearn import preprocessing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
import time

In [10]:
def Network(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
 
    length = len(OUT)
    train_x = IN[:int(0.9 * length)]
    validation_x = IN[int(0.9 * length):]
    train_y = OUT[:int(0.9 * length)]
    validation_y = OUT[int(0.9 * length):]

    # Define Network & callback:
    NAME = f"pb_{TIME_PERIOD}_{EPOCHS}_{BATCH_SIZE}_{LTSM_SHAPE}_{time.time()}"
    ternsorboard = TensorBoard(log_dir=f"logs/{NAME}")

    model = Sequential()
    model.add(LSTM(LTSM_SHAPE, input_shape=(train_x.shape[1:]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())  # normalizes activation outputs, same reason you want to normalize your input data.

    model.add(LSTM(LTSM_SHAPE, return_sequences=True))
    model.add(Dropout(0.1))
    model.add(BatchNormalization())

    model.add(LSTM(LTSM_SHAPE))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())

    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.2))

    model.add(Dense(2, activation='softmax'))

    opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    train_y = np.asarray(train_y)
    validation_y = np.asarray(validation_y)
    history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y), callbacks=[ternsorboard])
    print('\nhistory dict:', history.history)

    # Score model
    score = model.evaluate(validation_x, validation_y, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    # Save model
    model.save(f"models/{NAME}")

EPOCHS, BATCH_SIZE, LTSM_SHAPE = 15, 32, 128

Network(x, Y, Length, EPOCHS, BATCH_SIZE, LTSM_SHAPE)

Train on 12779 samples, validate on 1420 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15

history dict: {'loss': [0.3772664407379089, 0.2554241951101122, 0.14860147107236588, 0.12636716363225542, 0.11559720440646071, 0.10976048074974608, 0.10424331439216915, 0.09840627836480519, 0.09391533736871534, 0.09184688622146082, 0.0894049367658903, 0.08660989719856904, 0.08614293281905751, 0.08504147657204686, 0.08594321590976607], 'accuracy': [0.8447453, 0.89678377, 0.93645823, 0.94764847, 0.9504656, 0.95711714, 0.9597778, 0.9605603, 0.96196884, 0.9625166, 0.964786, 0.96564674, 0.9661945, 0.96736836, 0.96674234], 'val_loss': [0.36459749286443416, 0.18736157494951303, 0.1103813712445783, 0.10469699086437763, 0.08460978513032617, 0.09346335362380659, 0.09355278736583783, 0.10932256151253068, 0.07782354712171453, 0.07429858197418737, 0.08572134843594591, 0.08344595249

A classifier program with such high accuracy suggests that the data can potentially be used to construct a more useful model, and we will attempt to produce a regression program to predict the actual points that a player will get in future.

Begin similarly to before by making a function to produce appropriate groups of data, ordered temporaly and containing the paramaters which will be used to make the prediction:

In [11]:
def Sequence_data_i(passed_directory, length):
    j=0
    SEQ_LEN = length
    sequential_data = []  # this is a list that will CONTAIN the sequences
    for filename in os.listdir(f"{passed_directory}"):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}", index_col="round")
            input_df = df[["value", "transfers_in", "transfers_out", 'bps', 'ict_index', 'minutes', "goals_scored", "goals_conceded",'future_points']]  # performance predictior network from 'bps', 'ict_index', 'minutes'
            prev_days = deque(maxlen=SEQ_LEN)  
            for i in input_df.values: 
                prev_days.append([n for n in i[:-1]])  
                if len(prev_days) == SEQ_LEN:  
                    sequential_data.append([np.array(prev_days), i[-1]])  
    random.shuffle(sequential_data)
    #print(np.shape(sequential_data))
    #normal_data = min_max_scaler.fit_transform(np.array(sequential_data))
    return sequential_data

Length = 9
x1 = []
Y2 = []
two= Sequence_data_i(working_dir, Length)
for features, result in two:
    x1.append(features)
    Y2.append(result)

x1 = np.asarray(x1)
Y2 = np.asarray(Y2)

The x1 array contains the input paramaters and Y2 the corresponding output paramaters. Now the network can be trained using this data:

In [12]:
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
def Network_ii(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
 
    length = len(OUT)
    train_x = IN[:int(0.9 * length)]
    validation_x = IN[int(0.9 * length):]
    train_y = OUT[:int(0.9 * length)]
    validation_y = OUT[int(0.9 * length):]

    # Define Network & callback:
    train_x = train_x.reshape(train_x.shape[0],9, 8)
    validation_x = validation_x.reshape(validation_x.shape[0],9, 8)
    

    model = Sequential()
    model.add(LSTM(units=30, return_sequences= True, input_shape=(train_x.shape[1],8)))
    model.add(LSTM(units=30))
    model.add(Dense(units=1))
    model.compile(optimizer='adam', loss='mean_squared_error')

    train_y = np.asarray(train_y)
    validation_y = np.asarray(validation_y)
    history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y))
    print('\nhistory dict:', history.history)

    # Score model
    score = model.evaluate(validation_x, validation_y, verbose=0)
    print('Test loss:', score)
    # Save model
    model.save(f"models/new_model")

EPOCHS, BATCH_SIZE, LTSM_SHAPE = 30, 32, 128

Network_ii(x1, Y2, Length, EPOCHS, BATCH_SIZE, LTSM_SHAPE)

Train on 10646 samples, validate on 1183 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30

history dict: {'loss': [0.006928224190397749, 0.0044821763370748065, 0.0003202683994726448, 5.6149403688647174e-05, 2.9939153353574213e-05, 2.552905204382892e-05, 1.8787913586775155e-05, 1.5064386000361413e-05, 1.4522970285933987e-05, 1.1667269069317244e-05, 1.183880488292517e-05, 1.2666767255974068e-05, 8.424324036996025e-06, 7.176866161026183e-06, 1.0073570502067217e-05, 7.276396245042157e-06, 8.257799901677116e-06, 9.507845274183402e-06, 8.478407929926759e-06, 5.509480130482442e-06, 6.7984518316084505e-06, 6.94208156892051e-06, 5.239020275925985e-06, 7.192835839819311e-06, 5.46401939450864

In [13]:
player_list = pd.read_csv(f'raw_data/2019-20/player_idlist.csv')
time_period = 9
pred_week = 11
paramaters =["value", "transfers_in", "transfers_out", 'bps', 'ict_index', 'minutes', "goals_scored", "goals_conceded"]
actual_performance = ['future_points']
print("loading")
mod = tf.keras.models.load_model("models/new_model")
print("loaded")
facts = normlisation_factors
errs=[]
totalerror=0
TOT =0
TOTAV = 0
for i in range(len(player_list)):
    player = pd.read_csv(f'raw_data/2019-20/players/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{player_list.iloc[i][2]}/gw.csv', index_col='round')
    
    try:
        if player["bps"].iloc[11] > 0:
            print(f"{player_list.iloc[i][0]}_{player_list.iloc[i][1]}:")
            for var, value in facts.items():
                try:
                    player[var]= player[var].div(value)
                except:
                    KeyError: 'attempted_passes'
                    pass
            try:
                player_in = player[paramaters].iloc[pred_week - time_period:pred_week]
                IN = np.array(player_in.values)
                IN_reshaped = IN.reshape(1,9, 8)
                prediction = mod.predict(IN_reshaped)
                print("actual:", facts["bps"]*player["bps"].iloc[11])
                print("predicted:", facts["bps"]*prediction[0][0])
                totalerror += ((facts["bps"]*player["bps"].iloc[11]-facts["bps"]*prediction[0][0])**2)**0.5
                TOT +=1
                TOTAV += facts["bps"]*player["bps"].iloc[11]
            except (ValueError, IndexError) as e:
                print(e)
                errs.append(f"{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{player_list.iloc[i][2]}")
    
    except (ValueError, IndexError) as e:
        errs.append(f"{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{player_list.iloc[i][2]}")

loading
loaded
Héctor_Bellerín:
actual: 7.0
predicted: 0.1674309780355543
Sead_Kolasinac:
actual: 17.0
predicted: 1.9127255156636238
Rob_Holding:
actual: 9.0
predicted: 0.22184085845947266
Pierre-Emerick_Aubameyang:
actual: 10.0
predicted: 4.805386461317539
Alexandre_Lacazette:
actual: 3.0
predicted: 7.965954065322876
Bernd_Leno:
actual: 22.0
predicted: 12.945393919944763
Mesut_Özil:
actual: 12.0
predicted: 0.013773962447885424
Lucas_Torreira:
actual: 7.0
predicted: 0.22777131898328662
Matteo_Guendouzi:
actual: 13.0
predicted: 9.986929178237915
David_Luiz Moreira Marinho:
actual: 10.0
predicted: 26.08740082383156
Calum_Chambers:
actual: 14.0
predicted: 2.136921666562557
Nicolas_Pépé:
actual: 5.0
predicted: 14.219075426459312
Joseph_Willock:
actual: 10.0
predicted: 0.15311177889816463
Neil_Taylor:
actual: 3.0
predicted: 0.10137907415628433
Ørjan_Nyland:
actual: 13.0
predicted: 0.22538032662123442
Anwar_El Ghazi:
actual: 12.0
predicted: 2.9792553149163723
John_McGinn:
actual: 13.0
predic

actual: 21.0
predicted: 5.046436160802841
Rodrigo_Hernandez:
actual: 14.0
predicted: 0.0137713878066279
Aaron_Wan-Bissaka:
actual: 14.0
predicted: 10.875576853752136
Harry_Maguire:
actual: 9.0
predicted: 11.97307687997818
Victor_Lindelöf:
actual: 12.0
predicted: 15.000341981649399
Marcos_Rojo:
actual: 3.0
predicted: 0.12692507612518966
Marcus_Rashford:
actual: 20.0
predicted: 47.12889379262924
Mason_Greenwood:
actual: 3.0
predicted: 0.21465318999253213
David_de Gea:
actual: 10.0
predicted: 10.894525364041328
Anthony_Martial:
actual: 32.0
predicted: 5.6444689482450485
Jesse_Lingard:
actual: 6.0
predicted: 9.100685209035873
Daniel_James:
actual: 10.0
predicted: 21.912344723939896
Frederico_Rodrigues de Paula Santos:
actual: 13.0
predicted: 11.930544018745422
Andreas_Pereira:
actual: 28.0
predicted: 6.98422434926033
Scott_McTominay:
actual: 21.0
predicted: 32.036730229854584
DeAndre_Yedlin:
actual: 20.0
predicted: 11.955505192279816
Federico_Fernández:
actual: 14.0
predicted: 22.857514321

The mean root squared error for this model is:

In [16]:
totalerror/len(player_list)

4.044702092397316

The average number of point for a player in the data set is:

In [17]:
TOTAV/TOT

13.359848484848484

Thus the average error of the regression model is approximately 30 percent

In [19]:
4.04/13.4

0.30149253731343284

As the predictions above show, the model gives reasonably good predictions. Given the range of points that a player can achieve, although there is a significant error for some players, this is still useful metric, as the model tends to predict high scores for players who did score highly and vice versa. Due to the random nature of sport it is not surpising that there are a number of outliers however. Further invesigation into the nature of these outliers ( such as measuring the statistics of the error, eg, modelling it to a gaussian distribution)
would be useful in determining the exact nature of this model- it perhaps gives a good baseline prediction of players' performance, but does not allow for the random intricacies of the sport.