# Predicting performance direction of Premier League Footballers using 'Fantasy Football' data and Machine learning

Using a LTSM neural network trained with data collected from 3 seasons of Premier League Football games, a predictive model of player performance was made. By 'direction of performance,' it is simply meant whether a player performed better or worse in the next matched player relative to the previous week, i.e. a simple binary choice. Thus, a categorical model is used here.

Below the construction of this model is outlined.

### Importing and normalizing data
The data being used to train the model has been collected from the Official Premier League online game 'Fantasy Premier League." This online game tracks multiple performance attributes for the sport, such as number of shots or number of fouls committed by each player every week, and based on these variables gives each player a score every week. The score that each player gets depends primarily on the number of bonus points ('bps') they recieve, which is determined by set rules- for example 4 bps if they score a goal, 2 bps for playing 90 minutes (the length of a premier league football match) or -2 points if the player receives a red card. These rules are not of direct consequence for the analysis but the full list is found here- https://fantasy.premierleague.com/help/rules.

The number of bps awarded to a given player additionally depends on how well that player performs relative to other players in the same match for that week. Thus, bps, and total points awarded per week, are sufficiently random variables that prediction of them is a problem well suited for machine learning, given the large number of variables which may impact these dependent variables.

The first step for the analysis is to import the data, which is seperated into data for each season, and within each season separated into individual .csv files for each player. Each file then has various attribute values recorded for each week (of which there are 38 weeks). We begin by importing the required modules and defining the classification function, which will be used for defining the performance direction:

In [63]:
import pandas as pd
import numpy as np
import os
import datetime

def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0

Next, the csv files need to be processed. Using the file 'raw_data/2016-17/players/Sergio_Agüero/gw.csv' as an example to show the format of these files, we have:

In [64]:
dat = pd.read_csv("raw_data/2016-17/players/Sergio_Agüero/gw.csv")
dat.columns

Index(['assists', 'attempted_passes', 'big_chances_created',
       'big_chances_missed', 'bonus', 'bps', 'clean_sheets',
       'clearances_blocks_interceptions', 'completed_passes', 'creativity',
       'dribbles', 'ea_index', 'element', 'errors_leading_to_goal',
       'errors_leading_to_goal_attempt', 'fixture', 'fouls', 'goals_conceded',
       'goals_scored', 'ict_index', 'id', 'influence', 'key_passes',
       'kickoff_time', 'kickoff_time_formatted', 'loaned_in', 'loaned_out',
       'minutes', 'offside', 'open_play_crosses', 'opponent_team', 'own_goals',
       'penalties_conceded', 'penalties_missed', 'penalties_saved',
       'recoveries', 'red_cards', 'round', 'saves', 'selected', 'tackled',
       'tackles', 'target_missed', 'team_a_score', 'team_h_score', 'threat',
       'total_points', 'transfers_balance', 'transfers_in', 'transfers_out',
       'value', 'was_home', 'winning_goals', 'yellow_cards'],
      dtype='object')

To determine which variables potentially correlate with 'bps' and 'total_points,' the variable which the model will predict, use the corr() function:

In [65]:
dat["future_points"] = dat['bps'].shift(+1)
dat.corr()[["bps","future_points"]]

Unnamed: 0,bps,future_points
assists,0.500451,-0.194994
attempted_passes,0.586816,0.170655
big_chances_created,0.333012,-0.004426
big_chances_missed,0.048586,0.197941
bonus,0.90967,-0.015912
bps,1.0,-0.013018
clean_sheets,0.28278,0.242436
clearances_blocks_interceptions,0.18806,0.089263
completed_passes,0.614451,0.130888
creativity,0.728404,0.156208


Since the data set being used is large, and given that players in different positions will likely have different dependancies on different variables (for example, Sergio_Agüero is a forward so his bps is potentially more dependent on the variable 'creativity' and less dependent on 'clean_sheets' than a defender would be) we will initially use only the most general variables that have the highest correlation with bps:

In [92]:
variables = ["assists", "attempted_passes", "big_chances_missed", "clean_sheets", "completed_passes",
             "creativity", "dribbles", "goals_conceded", "goals_scored", "ict_index", "key_passes",
            "transfers_in", "transfers_out", "value", "bps", "total_points", "future_points", "round", "minutes"]
dat = dat[variables]
dat.head()

Unnamed: 0,assists,attempted_passes,big_chances_missed,clean_sheets,completed_passes,creativity,dribbles,goals_conceded,goals_scored,ict_index,key_passes,transfers_in,transfers_out,value,bps,total_points,future_points,round,minutes
0,0,33,0,0,26,29.6,1,1,1,14.3,2,0,0,130,33,9,,1,90
1,0,21,0,0,17,13.3,2,1,2,16.7,1,75648,12138,130,57,13,33.0,2,82
2,0,27,0,0,21,17.5,4,1,0,5.2,1,213510,8885,131,6,2,57.0,3,87
3,0,0,0,0,0,0.0,0,0,0,0.0,0,32042,984971,130,0,0,6.0,4,0
4,0,0,0,0,0,0.0,0,0,0,0.0,0,11435,167808,130,0,0,0.0,5,0


A general function is now required to loop through all the files and select the required columns as shown above. Whilst most files are 38 data entries long, there are some anomolies in length which need to be accounted for, which is done below. Other anomolies, such as missing data, are accounted for later.

In [93]:
def Select_Columns():
    new_directory = f"filtered_data_{datetime.date.today()}"
    os.mkdir(f"{new_directory}")
    rounds_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                    21, 22, 23, 24,25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]
    
    # Fix incorrect indexing in data:
    years = [[16, 17], [17, 18], [18, 19]]
    for year in years:
        player_list = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/player_idlist.csv')
        for i in range(len(player_list)):
            try:
                player = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/players/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}/gw.csv')
                if len(player["round"]) == 38:
                    if player["round"].iloc[-1] != 38:
                        player["round"].iloc[-1] = 38
                difference_length = 38 - len(player["round"])
                if len(player["round"]) == 38:
                    player["round"] = rounds_array
                else:
                    player["round"] = rounds_array[difference_length:]
                player["future_points"] = player['bps'].shift(+1)
                player = player[variables]
                with open(f"{new_directory}/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{year[0]}.csv", 'w') as file:
                    player.to_csv(file, header=True, mode='w',index=False)
                        
            except:
                player = pd.read_csv(f'raw_data/20{year[0]}-{year[1]}/players/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{player_list.iloc[i][2]}/gw.csv')
                if len(player["round"]) == 38:
                    if player["round"].iloc[-1] != 38:
                        player["round"].iloc[-1] = 38
                difference_length = 38 - len(player["round"])
                if len(player["round"]) == 38:
                    player["round"] = rounds_array
                else:
                    player["round"] = rounds_array[difference_length:]
                player["future_points"] = player['bps'].shift(+1)
                player = player[variables]
                with open(f"{new_directory}/{player_list.iloc[i][0]}_{player_list.iloc[i][1]}_{year[0]}.csv", 'w') as file:
                    player.to_csv(file, header=True, mode='w',index=False)
                        
    return new_directory

working_dir = Select_Columns()

Some of the files many not be of length 38 because the player they correspond too didn't play for a club in the premier league for the whole season. In this case, the rounds are re-labelled, to provide some consistency in how the data is formatted, which is necessary as later the rounds will be used as an index. N.b. the fact that the round indices may not be correctly labelled (as some players may play for the first half of the season and others the second etc, and this detail is lost in the above function's relabelling) is not relevant, as the network will be trained to analyse short-term patterns, and thus the time of season is not a factor. However, this is an improvement that could be investigated in future.

Next, normalize and then structure the data in preparation for passing onto the LTSM:

In [94]:
def Normalise(passed_directory):
    """ Add new column, 'Performance_direction,' to label whether a player performed better
    or worse next week relative to the current week:"""
    
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            df["performance_direction"] = list(map(classify, df["future_points"], df["total_points"]))
            with open(f"{passed_directory}/{filename}", 'w') as file:
                df.to_csv(file, header=True, mode='w',index=False)
    
    """ Create a new dict to hold the maxiumum values for each variable, which can then
    be used to normalise the data. The column 'round' is dropped as it is not a variable,
    and 'performance_direction' is added."""
    
    max_values= np.zeros(19)
    variables.remove("round")
    variables.append("performance_direction")
    variables_max_values = dict(zip(variables, max_values))
    """ loop through all files to determine max values"""
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            if len(df.index) > 1:
                for var, value in variables_max_values.items():
                    if df.at[df[var].idxmax(axis=1), var] > value:
                        variables_max_values[var] = df.at[df[var].idxmax(axis=1), var]
            else:
                pass
            
    """ loop through a second time to normalise"""
    os.mkdir(f"Normalised_{passed_directory}")
    for filename in os.listdir(passed_directory):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}")
            for var, value in variables_max_values.items():
                df[var]= df[var].div(value)
    
        """ save file for later if there are more than 10 non-zero entries"""
        if df["minutes"][df["minutes"] > 0].count() > 10:
            with open(f"Normalised_{passed_directory}/{filename}", 'w') as file:
                df.to_csv(file, header=True, mode='w', index=False)
    new_directory = f"Normalised_{passed_directory}"
    return variables_max_values, new_directory
    
normlisation_factors, working_dir = Normalise(working_dir)

In [95]:
import sklearn
from sklearn import preprocessing

In [99]:
from collections import deque
import random

def Sequence_data(passed_directory, length):
    j=0
    SEQ_LEN = length
    sequential_data = []  # this is a list that will CONTAIN the sequences
    for filename in os.listdir(f"{passed_directory}"):
        if filename.endswith(".csv"):
            df = pd.read_csv(f"{passed_directory}/{filename}", index_col="round")
            input_df = df[["transfers_in", "transfers_out",'bps', 'ict_index', 'minutes', 'performance_direction']]  # performance predictior network from 'bps', 'ict_index', 'minutes'
            prev_days = deque(maxlen=SEQ_LEN)  
            for i in input_df.values: 
                prev_days.append([n for n in i[:-1]])  
                if len(prev_days) == SEQ_LEN:  
                    sequential_data.append([np.array(prev_days), i[-1]])  
    random.shuffle(sequential_data)
    #print(np.shape(sequential_data))
    #normal_data = min_max_scaler.fit_transform(np.array(sequential_data))
    return sequential_data

Length = 3
one= Sequence_data(working_dir, Length)


In [117]:
x = []
Y = []
for features, result in one:
    x.append(features)
    Y.append(result)
x = np.array(x, dtype="float32")


In [118]:
x[1:].shape

(40305, 3, 5)

In [119]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint

In [120]:
import time

In [122]:
def Network(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
 
    length = len(OUT)
    train_x = IN[:int(0.9 * length)]
    validation_x = IN[int(0.9 * length):]
    train_y = OUT[:int(0.9 * length)]
    validation_y = OUT[int(0.9 * length):]

    # Define Network & callback:
    NAME = f"pb_{TIME_PERIOD}_{EPOCHS}_{BATCH_SIZE}_{LTSM_SHAPE}_{time.time()}"
    ternsorboard = TensorBoard(log_dir=f"logs/{NAME}")

    model = Sequential()
    model.add(LSTM(LTSM_SHAPE, input_shape=(train_x.shape[1:]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())  # normalizes activation outputs, same reason you want to normalize your input data.

    model.add(LSTM(LTSM_SHAPE, return_sequences=True))
    model.add(Dropout(0.1))
    model.add(BatchNormalization())

    model.add(LSTM(LTSM_SHAPE))
    model.add(Dropout(0.2))
    model.add(BatchNormalization())

    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.2))

    model.add(Dense(2, activation='softmax'))

    opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    train_y = np.asarray(train_y)
    validation_y = np.asarray(validation_y)
    history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y), callbacks=[ternsorboard])
    print('\nhistory dict:', history.history)

    # Score model
    score = model.evaluate(validation_x, validation_y, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    # Save model
    model.save(f"models/{NAME}")

EPOCHS, BATCH_SIZE, LTSM_SHAPE = 15, 32, 128

Network(x, Y, Length, EPOCHS, BATCH_SIZE, LTSM_SHAPE)

Train on 36275 samples, validate on 4031 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

history dict: {'loss': [0.2872269661006724, 0.11790631096163119, 0.09565196998935992, 0.08845769866202143, 0.08298545957330504, 0.07887787106358989, 0.072709755674364, 0.07453491659800321, 0.07080402642930071, 0.07029391212413263, 0.0670176158662626, 0.0676957179657052, 0.06502105751976649, 0.06353162377821134, 0.06347888470425185, 0.06309265810998861, 0.06198033247759205, 0.061591503663228596, 0.06114299285892451, 0.06017098011681329], 'accuracy': [0.88945556, 0.9572157, 0.96645075, 0.9681323, 0.9705582, 0.9722674, 0.9740868, 0.9733977, 0.9756582, 0.97499657, 0.9755203, 0.97626466, 0.9773122, 0.97863543, 0.9782219, 0.97816676, 0.9783873, 0.97783595, 0.9789111, 0.97877324], 'val_loss': [0.11282417357441096, 0.

FileExistsError: [Errno 17] File exists: 'models'

In [None]:
"""
https://fantasy.premierleague.com/api/my-team/2634476/ -> specifc fantasy team's players

"""


def Update_files():
    url = "https://fantasy.premierleague.com/api/bootstrap-static/"
    r = requests.get(url)
    data = json.loads(r.text)           # Convert data to python dict and then retrive data with the dict:

    gameweeks = data['events']          # 'gameweeks' data contains average fantasy team score, the dream team score, and the most select, most transfered in, most captian
    # vice-captained and highest sccoring players, with the players being reference using the 'id' tag.
    teams = data['teams']
    players = data['elements']
    for gameweek in gameweeks:
        del gameweek['chip_plays']       # Don't need this data and it is problematic to format

    for event in gameweeks:              # Get current week.
        if event['finished'] == True:
            current_gameweek = event['id']
        else:
            if event["finished"] == False:
                break

    for player in players:
        del (player["chance_of_playing_next_round"],
             player["chance_of_playing_this_round"],
             player["cost_change_event"],
             player["cost_change_event_fall"],
             player["cost_change_start"],
             player["cost_change_start_fall"],
             player["value_form"],
             player["value_season"],
             player["web_name"],
             player["transfers_out_event"],
             player["transfers_in_event"],
             player["status"],
             player["squad_number"],
             player["special"],
             player["photo"],
             player["news_added"],
             player["news"],
             player["ep_this"],
             player["ep_next"])
        player["Gameweek"] = current_gameweek

    # Save this week's data as a csv and the original json formated data to txt.
    player_df = pd.DataFrame(data=players)
    gameweeks_df = pd.DataFrame(data=gameweeks)
    gameweeks_df.to_csv("Gameweeks/Overall_FF_stats.csv", header=True, mode='w', index=False)
    player_df.to_csv(f'Gameweeks/GW{current_gameweek}_player_data.csv', header=True, mode='w', index=False)
    pickle.dump(r, open(f"Gameweeks/GW{current_gameweek}_player_data.txt", 'wb'))
    # Open with p = pickle.load(open(f"Gameweeks/GW{cuurent_gameweek}_test.txt", 'rb')) & data = json.loads(p.text)
    # Append player data to player files:
    for index, player in player_df.iterrows():
        if os.path.isfile(f'NEW_DATA_adjusted/{player["first_name"]}_{player["second_name"]}_{player["id"]}.csv') == True:
            total_data = pd.read_csv(f'NEW_DATA_adjusted/{player["first_name"]}_{player["second_name"]}_{player["id"]}.csv')
            if player["Gameweek"] > total_data["Gameweek"].iloc[-1]:
                result = total_data.append(player)
                result.to_csv(f'NEW_DATA_adjusted/{player["first_name"]}_{player["second_name"]}_{player["id"]}.csv', header=True, mode='w', index=False)
        else:
            save_data = pd.DataFrame([player])
            save_data.to_csv(f'NEW_DATA_adjusted/{player["first_name"]}_{player["second_name"]}_{player["id"]}.csv', header=True, mode='w', index=False)
