In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from helpers import *

In [2]:
# path to project directory
path = Path('./')

In [3]:
# read in training dataset
train_df = pd.read_csv(path/'data/train_v4.csv', index_col=0, dtype={'season':str})

## Baseline model

A baseline model is an effective way to see whether more sophisticated approaches actually have more predictive power.

The approach we will take is very straightforward: a player's predicted score is simply the points scored per minute for every minute played in history, multiplied by the actual number of minutes played in the gameweek being predicted (note that, at the moment, the predicted number of minutes is not included in the model).

We'll start by using the lag_features functions to obtain the points per game (90 minutes) for all games in history at each gameweek for each player (including team and opponent team points per game).

In [4]:
# add all lag points scored to training set
train_df, team_lag_vars = team_lag_features(train_df, ['total_points'], ['all'])
train_df, player_lag_vars = player_lag_features(train_df, ['total_points'], ['all'])

Our baseline model function generates predictions using the above approach for a given validation point (i.e. which gameweek the validation starts) and validation length (how many gameweeks will we include, from the validation point). If a player has no history then we just assume that they will score 0 points.

We return the predictions and actuals to allow easy performance assessment.

In [5]:
def simple_model(df, valid_season, valid_gw, valid_len):
    valid_start, valid_end = validation_season_idx(df, valid_season, [valid_gw], valid_len)[0]
    
    train_idx = range(valid_start)
    valid_idx = range(valid_start, valid_end + 1)    
    
    train = df.iloc[train_idx]
    valid = df[['player', 'gw', 'season', 'minutes', 'total_points']].iloc[valid_idx]
    
    season_point = valid['season'].iloc[0]
    gw_point = valid['gw'].iloc[0]
    
    # get player total per game average at validation point
    player_points_pg = df[(df['season'] == season_point) & 
                          (df['gw'] == gw_point)][['player', 'total_points_pg_last_all']]
    
    pred_df = valid.merge(player_points_pg, on='player', how='left')   
    pred_df.fillna(0, inplace=True)
    preds = pred_df['total_points_pg_last_all'] * pred_df['minutes'] / 90
    targs = pred_df['total_points']
    
    return preds, targs

We can ahead and use this model to make some predictions.

In [61]:
# set validation start at gameweek 20 in the 2019/20 season
# validate 6 gameweek's of predictions
valid_season = '1920'
valid_gw = 20
valid_len = 6

# make predictions
preds, targs = simple_model(train_df, valid_season, valid_gw, valid_len)

And calculate the root mean squared error for all predictions

In [7]:
# calculate rsme
r_mse(preds, targs)

1.904038

It's worth running little tests to check for errors. In this case we can perform the same prediction using a slightly more direct way (not using any indexes).

In [8]:
# we can calculate the same predictions directly using the season and gameweek
# to check the above function 
# get player total per game average at validation point
player_points_pg = train_df[(train_df['season'] == '1920') & 
                            (train_df['gw'] == valid_gw)][['player', 'total_points_pg_last_all']]

valid_df = train_df[(train_df['season'] == '1920') 
                    & (train_df['gw'] > valid_gw - 1)
                    & (train_df['gw'] < valid_gw + 6)][['player', 'minutes', 'total_points']]

pred_df = valid_df.merge(player_points_pg, on='player', how='left')
pred_df['pred'] = pred_df['total_points_pg_last_all'] * pred_df['minutes'] / 90
pred_df.fillna(0, inplace=True)

In [9]:
# calculate rmse
r_mse(pred_df['pred'], pred_df['total_points'])

1.904038

The two error values should equal. The intention is to introduce more of these tests in a more robust software engineering way (i.e. unit tests) in the future.

In the above model we use the lag points per game values from the first validation gameweek, since these contain information for everything up to but not including the validation gameweek. However, these lag variables are calculated for each gameweek, so it is important not to use them for predictions being made more than 1 gameweek after the validation point. The above model gives us a useful way to create a new training dataset that avoids this problem, setting lag features to be equal to the first valition gameweek for all subsequent gameweeks.

This can then be used to train and validate some more sophisticated modeling.

In [54]:
# We can adapt this approach to also create validation sets with lag features
# When making predictions for gw +2 and beyond we cannot use those weeks's lag features
# This would be leakage if we did
# Instead, each subsequent validation week should have the same lag values as the first
def create_lag_train(df, cat_vars, cont_vars, player_lag_vars, team_lag_vars, dep_var, valid_season, valid_gw, valid_len):

    # get all the lag data for the current season up to the first validation gameweek
    player_lag_vals = df[(df['season'] == valid_season) & 
                         (df['gw'] <= valid_gw)][['player', 'kickoff_time'] + player_lag_vars]
    
    team_lag_vals = df[(df['season'] == valid_season) & 
                       (df['gw'] <= valid_gw)][['team', 'kickoff_time'] + 
                                               [x for x in team_lag_vars if "opponent" not in x]].drop_duplicates()
                                               
    opponent_team_lag_vals = df[(df['season'] == valid_season) & 
                                (df['gw'] <= valid_gw)][['opponent_team', 'kickoff_time'] + 
                                                        [x for x in team_lag_vars if "opponent" in x]].drop_duplicates()
    
    # get the last available lag data for each player
    # for most it will be the first validation week
    # but sometimes teams have blank gameweeks
    # in these cases it will be the previous gameweek
    player_lag_vals = player_lag_vals[player_lag_vals['kickoff_time'] == 
                                      player_lag_vals.groupby('player')['kickoff_time'].transform('max')]
    team_lag_vals = team_lag_vals[team_lag_vals['kickoff_time'] == 
                                  team_lag_vals.groupby('team')['kickoff_time'].transform('max')]
    opponent_team_lag_vals = opponent_team_lag_vals[opponent_team_lag_vals['kickoff_time'] == 
                                                    opponent_team_lag_vals.groupby('opponent_team')['kickoff_time'].transform('max')]
                                                                    
    player_lag_vals = player_lag_vals.drop('kickoff_time', axis=1)
    team_lag_vals = team_lag_vals.drop('kickoff_time', axis=1)
    opponent_team_lag_vals = opponent_team_lag_vals.drop('kickoff_time', axis=1)
    
    # get the validation start and end indexes
    valid_start, valid_end = validation_season_idx(df, valid_season, [valid_gw], valid_len)[0]
    train_idx = range(valid_start)
    valid_idx = range(valid_start, valid_end + 1)    

    # split out train and validation sets
    # do not include lag vars in validation set
    cat_vars = list(set(['opponent_team', 'team', 'player'] + cat_vars))
    
    train = df[cat_vars + cont_vars + 
               player_lag_vars + team_lag_vars + 
               dep_var].iloc[train_idx]
    valid = df[cat_vars + cont_vars + dep_var].iloc[valid_idx]

    # add in lag vars
    # will be the same for all validation gameweeks
    valid = valid.merge(player_lag_vals, on='player', how='left')
    valid = valid.merge(team_lag_vals, on='team', how='left')
    valid = valid.merge(opponent_team_lag_vals, on='opponent_team', how='left')
    
    # concatenate train and test again
    lag_train_df = pd.concat([train, valid]).reset_index()

    return lag_train_df, train_idx, valid_idx

The arguments for this function also includes the categorical, continuous and lagging features that we want to include in our model - this fits nicely with the APIs for the algorithms that we will subsequently be applying.

In [55]:
# variables to include
# lag variables are already generated in the player_lag_features function, but can be changed
cat_vars = ['gw', 'season', 'team', 'opponent_team']
cont_vars = ['minutes']
dep_var = ['total_points']

lag_train_df,_,_ = create_lag_train(train_df, 
                                cat_vars, cont_vars, 
                                player_lag_vars, team_lag_vars, dep_var,
                                valid_season, valid_gw, valid_len)

In [56]:
lag_train_df

Unnamed: 0,index,gw,season,team,player,opponent_team,minutes,total_points_pg_last_all,total_points_team_pg_last_all,total_points_team_pg_last_all_opponent,total_points
0,0,1,1617,West Ham United,Aaron_Cresswell,Chelsea,0,,,,0
1,1,1,1617,Everton,Aaron_Lennon,Tottenham Hotspur,15,,,,1
2,2,1,1617,Arsenal,Aaron_Ramsey,Liverpool,60,,,,2
3,3,1,1617,Watford,Abdoulaye_Doucouré,Southampton,0,,,,0
4,4,1,1617,Chelsea,Abdul Rahman_Baba,West Ham United,0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...
88440,4007,35,1920,Manchester City,Tommy_Doyle,Brighton and Hove Albion,0,,54.811189,35.323810,0
88441,4008,35,1920,West Ham United,Joseph_Anang,Norwich,0,,38.279720,32.862069,0
88442,4009,35,1920,Burnley,Erik_Pieters,Liverpool,64,2.737643,37.335664,54.538462,2
88443,4010,35,1920,Tottenham Hotspur,Japhet_Tanganga,Arsenal,0,2.914286,48.664336,45.412587,0


We can do a quick quick to make sure that the lag variables are the same for gameweeks 20-25

In [59]:
# check features for a player
lag_train_df[lag_train_df['player'] == 'Kevin_De Bruyne'].tail(10)

Unnamed: 0,index,gw,season,team,player,opponent_team,minutes,total_points_pg_last_all,total_points_team_pg_last_all,total_points_team_pg_last_all_opponent,total_points
82428,82428,26,1920,Manchester City,Kevin_De Bruyne,West Ham United,78,6.351989,54.805755,38.460432,14
83051,83051,27,1920,Manchester City,Kevin_De Bruyne,Leicester City,90,6.437166,54.928571,40.842857,3
84180,84180,29,1920,Manchester City,Kevin_De Bruyne,Manchester United,0,6.403044,54.893617,45.535211,0
84881,448,30,1920,Manchester City,Kevin_De Bruyne,Arsenal,69,6.492611,54.811189,45.412587,14
84882,449,30,1920,Manchester City,Kevin_De Bruyne,Burnley,29,6.492611,54.811189,37.335664,1
85576,1143,31,1920,Manchester City,Kevin_De Bruyne,Chelsea,90,6.492611,54.811189,49.335664,8
86221,1788,32,1920,Manchester City,Kevin_De Bruyne,Liverpool,90,6.492611,54.811189,54.538462,14
86872,2439,33,1920,Manchester City,Kevin_De Bruyne,Southampton,31,6.492611,54.811189,37.083916,1
87525,3092,34,1920,Manchester City,Kevin_De Bruyne,Newcastle United,90,6.492611,54.811189,38.314286,7
88178,3745,35,1920,Manchester City,Kevin_De Bruyne,Brighton and Hove Albion,63,6.492611,54.811189,35.32381,3


In [60]:
# check features for a player
lag_train_df[lag_train_df['player'] == 'Mohamed_Salah'].tail(10)

Unnamed: 0,index,gw,season,team,player,opponent_team,minutes,total_points_pg_last_all,total_points_team_pg_last_all,total_points_team_pg_last_all_opponent,total_points
82132,82132,26,1920,Liverpool,Mohamed_Salah,Norwich,90,8.136036,54.863309,32.92,3
82755,82755,27,1920,Liverpool,Mohamed_Salah,West Ham United,90,8.079193,54.892857,38.314286,7
83365,83365,28,1920,Liverpool,Mohamed_Salah,Watford,90,8.06738,54.843972,35.574468,2
83971,83971,29,1920,Liverpool,Mohamed_Salah,Bournemouth,90,8.001684,54.598592,37.584507,9
84526,93,30,1920,Liverpool,Mohamed_Salah,Everton,0,8.012378,54.538462,41.440559,0
85275,842,31,1920,Liverpool,Mohamed_Salah,Crystal Palace,90,8.012378,54.538462,38.776224,11
85919,1486,32,1920,Liverpool,Mohamed_Salah,Manchester City,90,8.012378,54.538462,54.811189,2
86567,2134,33,1920,Liverpool,Mohamed_Salah,Aston Villa,90,8.012378,54.538462,35.448276,6
87220,2787,34,1920,Liverpool,Mohamed_Salah,Brighton and Hove Albion,90,8.012378,54.538462,35.32381,18
87873,3440,35,1920,Liverpool,Mohamed_Salah,Burnley,90,8.012378,54.538462,37.335664,2
