# Predicting FPL Scores

This is an example of how to use the FPL dataset to predict players scores.

## Initial Steps

Let's first do all imports that we will need, and read the data in from the csv files

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.utils.estimator_checks import check_estimator
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from typing import Optional, Union

data = pd.read_csv('../input/fpl-dataset/data.csv', index_col='entry_id')
element_types = pd.read_csv('../input/fpl-dataset/element_types.csv', index_col='id')
players = pd.read_csv('../input/fpl-dataset/players.csv', index_col='id')
teams = pd.read_csv('../input/fpl-dataset/teams.csv', index_col='id')

We now look at the columns of each dataframe:

In [None]:
print(data.info())

data.head()

In [None]:
print(element_types.info())

element_types.head()

In [None]:
print(players.info())

players.head()

In [None]:
print(teams.info())

teams.head()

## Preparing and Analysing the Data

Our aim for this notebook is to use the data collected in gameweeks 33-37 to predict the scores in gameweek 38. Therefore, we should split the 'data' table into training and testing subsets based on these gameweeks.

In [None]:
train_data = data[data.event_id < 38]
test_data = data[data.event_id == 38]

print("Training data entries:", train_data.shape[0])
print("Test data entries:", test_data.shape[0])

This seems like a good split. First notice that the columns timestamp, event_id, opposition, fixture_code and kickoff_time are purely for administrative purposes. They will not help us with our predictions, so we can safely delete these.

In [None]:
removed_cols = ['timestamp', 'fixture_code', 'kickoff_time', 'opposition', 'event_id']

for col in removed_cols:
    del train_data[col]
    del test_data[col]

Our target column to predict is going to be the 'response' of the player, i.e. how many points they went on to score in that gameweek. Let's explore how some of the columns influence this value in the training set. A first obvious column is the player's status: if a player is injured or unavailable, they are almost certain to score 0, while if they are fully fit and available, they are much more likely to score points. This is illustrated below. 

In [None]:
plt.figure(figsize=(10, 10))
sns.histplot(x=train_data.response, hue=train_data.status, binwidth=1)


As this shows that if a player's status is not 'available', then they are very likely to score 0 points, it is therefore only interesting to explore the data for those whose status is available. This is based around the number of minutes a player plays - if a player is not available, they will play 0 minutes and therefore score 0. How else can we predict minutes? Well, we have the data for minutes for the last three matches.

In [None]:
played_match_1 = train_data[train_data.status == 'a']["minutes_1"] > 0
played_match_2 = train_data[train_data.status == 'a']["minutes_2"] > 0
played_match_3 = train_data[train_data.status == 'a']["minutes_3"] > 0

fig, ax = plt.subplots(1, 3, figsize=(20, 5), sharey=True)
sns.kdeplot(x=train_data[train_data.status == 'a'].response, hue=played_match_1, shade=True, ax=ax[0])
sns.kdeplot(x=train_data[train_data.status == 'a'].response, hue=played_match_2, shade=True, ax=ax[1])
sns.kdeplot(x=train_data[train_data.status == 'a'].response, hue=played_match_3, shade=True, ax=ax[2])

We see that a player is much more likely to score points if they are available and played > 0 minutes in the last three matches. Pretty self-explanatory right? So we've now explored what makes a player more likely to play, but what affects their score once they're on the pitch? The easiest features to look at are the points they scored in their last three - their **form**.

In [None]:
available_players = train_data[train_data.status == 'a'].copy()

sns.jointplot(x=available_players.form, y=available_players.response, kind='kde', fill=True)
sns.jointplot(x=available_players.points_1, y=available_players.response, kind='kde', fill=True)
sns.jointplot(x=available_players.points_2, y=available_players.response, kind='kde', fill=True)
sns.jointplot(x=available_players.points_3, y=available_players.response, kind='kde', fill=True)

In [None]:
sns.histplot(x=available_players.response, hue=available_players.is_home, binwidth=1)

## Preprocessing and Data Cleaning

Ahead of the modelling phase, we should clean up the data so that:

- Dummy fields are created for categorical variables
- Any missing data is filled in
- Erroneous data entries are removed
- Features are scaled so as to be comparable

In [None]:
train_data.info()

In terms of categorical variables, there is just one - status. We already explored the different values here, so we will use pandas get_dummies function to create dummy columns

In [None]:
train_data = pd.get_dummies(train_data)
test_data  = pd.get_dummies(test_data)

train_data.info()
test_data.info()

This function successfully splits the training and testing data into 6 diffrent boolean columns based on the status. Now let's fill in those missing values. For chance of playing next round, it makes sense to fill this column with 100%, since this is based on a player's availability, and more often than not a player is available. For the remaining data points, it is safe to assume that if values have not been found, it is because the player has not played enough games to have a value. Therefore, we will fill those values with zeros. 

In [None]:
def fill_values(df):
    df['chance_of_playing_this_round'] = df['chance_of_playing_this_round'].fillna(100)
    df.fillna(0, inplace=True)
    
fill_values(train_data)
fill_values(test_data)

train_data.head()

In [None]:
train_data.info()

We can see that all columns are now numeric and completely filled with data. For the purposes of training a model, also notice that player_id is no longer required. This will be useful later but only in the test set where we look at our predictions again.

In [None]:
del train_data['player_id']

Now we scale the features so that they are comparable - we do this using scikit-learn's StandardScaler class. First, we separate out the response column and the features, into numpy arrays X_train and y_train. Then we scale X_train based on the mean and standard deviation of each column.

In [None]:
features = list(train_data.columns)
features.remove('response')
X_train, y_train = train_data[features].values, train_data['response'].values
print(X_train.shape)
print(y_train.shape)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
print(X_train)

## Training a Model

The data is now fully prepared for modelling - all features are numeric, with no gaps, and scaled to be comparable. We have also extracted the target variable, 'response' into y_train for training the model. So, let's try a basic Linear Regression first.

### Linear Regression

In [None]:
model1 = LinearRegression()
model1.fit(X_train, y_train)

Prepare the test data in the same way:

In [None]:
player_ids = test_data['player_id'].values 
del test_data['player_id']
X_test, y_test = test_data[features].values, test_data['response'].values
X_test = scaler.transform(X_test)

Now test the model's output:

In [None]:
def model_to_predictions(model):
    predictions = model.predict(X_test)
    predictions = [[player_ids[i], predictions[i], y_test[i]] for i in range(len(player_ids))]
    predictions = pd.DataFrame(predictions, columns=['id', 'prediction', 'actual']).set_index('id')
    predictions = predictions.join(players, on='id')
    return predictions

model_to_predictions(model1).sort_values('prediction', ascending=False)

In [None]:
model1.score(X_test, y_test)

We see that the basic linear regression model does fairly well, giving an $R^2$ coefficient of 26% of predictions vs actual scores. Let's visualise this below.

In [None]:
def plot_predictions(model):
    sns.regplot(x=model.predict(X_test), y=y_test)
    plt.xlabel('Prediction')
    plt.ylabel('Response')
    
plot_predictions(model1)

### Random Forest Regression

Now we try a more sophisticated model, called random forest regression. This takes the average of many different decision trees in order to make a prediction.

In [None]:
model2 = RandomForestRegressor(n_estimators=500, max_depth=10, min_samples_split=2, min_samples_leaf=5)
model2.fit(X_train, y_train)

In [None]:
model_to_predictions(model2).sort_values('prediction', ascending=False)

In [None]:
model2.score(X_test, y_test)

In [None]:
plot_predictions(model2)

We see that using this more sophisticated model only achieves a similar $R^2$ value to linear regression.

# Custom Regression

Notice that so far, a lot of players who went on to score zero were predicted more using these models. This is because the model is not quite 'detecting' whether a player is likely to play or not. In effect, it is hard to predict such a skewed distribution without overfitting. So, we aim to combat this using 'hurdle' regression. I.e. we train a classifier to predict the probability that a player scores a non-zero amount of points, then multiply the output of this by another regression model. Note that the regression model must only be trained on data points with a non-zero response. This uses the fact that for any random variable $X$:

$E(X) = E(X | X \neq 0) P(X \neq 0)$

In [None]:
class CustomRegressor(BaseEstimator):
    """ Regression model which handles excessive zeros by fitting a two-part model and combining predictions:
            1) binary classifier
            2) continuous regression
    Implementeted as a valid sklearn estimator, so it can be used in pipelines and GridSearch objects.
    Args:
        clf_name: currently supports either 'logistic' or 'LGBMClassifier'
        reg_name: currently supports either 'linear' or 'LGBMRegressor'
        clf_params: dict of parameters to pass to classifier sub-model when initialized
        reg_params: dict of parameters to pass to regression sub-model when initialized
    """

    def __init__(self,
                 clf_name: str = 'logistic',
                 reg_name: str = 'linear',
                 clf_params: Optional[dict] = None,
                 reg_params: Optional[dict] = None):

        self.clf_name = clf_name
        self.reg_name = reg_name
        self.clf_params = clf_params
        self.reg_params = reg_params

    @staticmethod
    def _resolve_estimator(func_name: str):
        """ Lookup table for supported estimators.
        This is necessary because sklearn estimator default arguments
        must pass equality test, and instantiated sub-estimators are not equal. """

        funcs = {'linear': RandomForestRegressor(n_estimators=500, random_state=0),
                 'logistic': LogisticRegression(random_state=0)}

        return funcs[func_name]

    def fit(self,
            X: Union[np.ndarray],
            y: Union[np.ndarray]):
        X, y = check_X_y(X, y, dtype=None,
                         accept_sparse=False,
                         accept_large_sparse=False,
                         force_all_finite='allow-nan')

        if X.shape[1] < 2:
            raise ValueError('Cannot fit model when n_features = 1')

        self.clf_ = self._resolve_estimator(self.clf_name)
        if self.clf_params:
            self.clf_.set_params(**self.clf_params)
        self.clf_.fit(X, y != 0)

        self.reg_ = self._resolve_estimator(self.reg_name)
        if self.reg_params:
            self.reg_.set_params(**self.reg_params)
        self.reg_.fit(X[y != 0], y[y != 0])

        self.is_fitted_ = True
        return self

    def predict(self, X: Union[np.ndarray]):
        """ Predict combined response using binary classification outcome """
        X = check_array(X, accept_sparse=False, accept_large_sparse=False)
        check_is_fitted(self, 'is_fitted_')
        return self.clf_.predict_proba(X)[:, 1] * self.reg_.predict(X)

In [None]:
model3 = CustomRegressor()
model3.fit(X_train, y_train)

model_to_predictions(model3).sort_values('prediction', ascending=False)


In [None]:
r2_score(y_test, model3.predict(X_test))

In [None]:
plot_predictions(model3)

Again, opting for an even more complex model doesn't actually give us a great benefit. This suggests that we need to revisit the data - can features be engineered, or the data manipulated in a different way to achieve better results? Of course, more data will also be helpful to such a model - as more games are played in the seasons to come, we will be able to collect more of this data and refine our predictions.

## Conclusions

This data allows us to make a nice start in making predictions in FPL. Ultimately, how well a human-being will perform on any given matchday is highly unpredictable, but an $R^2$ value of 26% at least gives us hope that as FPL managers, we can use this as a guide to help us out.