In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from IPython.display import display

In [2]:
# path to project directory
path = Path('./')

In [5]:
# read in training dataset
train_df = pd.read_csv(path/'data/train_v5.csv', index_col=0, dtype={'season':str})

## The FPL dataset

These are the fields in the base dataset, all from fpl and transfermarkt, which are updated after the conclusion of every gameweek.

In [6]:
# summary of fields
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91401 entries, 0 to 91400
Data columns (total 37 columns):
player                                        91401 non-null object
gw                                            91401 non-null int64
position                                      91401 non-null int64
minutes                                       91401 non-null int64
team                                          91401 non-null object
opponent_team                                 91401 non-null object
relative_market_value_team                    23465 non-null float64
relative_market_value_opponent_team           23465 non-null float64
was_home                                      91401 non-null bool
total_points                                  91401 non-null int64
assists                                       91401 non-null int64
bonus                                         91401 non-null int64
bps                                           91401 non-null int64
clean_sheets  

Each row represents one player's performance in a single fixture, and will be unique across the player name and kickoff time fields:

- player (player name)
- kickoff_time (kickoff time for the fixture)

The fixtures are futher defined with the following fields:

- team (the player's team)
- opponent_team (the opposition team)
- was_home (was it a home game for the player)
- season (e.g. '1920' for the 2019/20 season)
- gw (the FPL gameweek in which the fixture occured)

Note that there can be multiple fixtures (i.e. rows for a given player) in a single gameweek - so called double gameweeks.

The position that a player plays is also given, this will be consistent for each player within seasons, but may change between seasons:

- position (1 - goalkeeper, 2 - defender, 3 - midfielder, 4 - forward)

Most of the other fields describe the player (or team's) performance in the fixture e.g. the number of munites played, points scored, assists, goals, goals conceded while on the field, etc.

All the above should be 100% complete for all rows.

Incomplete fields for FPL data are:

- transfer and selected values (transfers_in, transfers_out, transfers_balance, selected) - these were only collected from the start of the 2019/20 season, and require further investigation as to what they actually represent (in other words, treat with caution when modelling); values prior to the 2019/20 are set to 0
- play_proba - again only collected from the start of the 2019/20 season, this is the probability that the the player would actually be available for the fixture according to the FPL website (note that the time that this is captured each week  varies); values prior to the 2019/20 are null, and they are also null for any new players in a given gameweek (i.e. players that FPL has added to the game during that gameweek)

Finally, team transfer market value is taken from transfermarkt each week (for the 2019/20) season or a single value has been taken for the whole season:

- relative_market_value_team - the market value for the team scraped during that gameweek (non null from start of 2019/20 season)
- relative_market_value_opponent_team - the market value for the opposition team scraped during that gameweek (non null from start of 2019/20 season)
- relative_market_value_team_season - a single value for the team's value from the the start of each season 
- relative_market_value_opponent_team_season - a single value for the opposition team's value from the the start of each season 

In [7]:
# take a look at some data
pd.options.display.max_columns = None
train_df.head(10)

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season
0,Aaron_Cresswell,1,2,0,West Ham United,Chelsea,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,14023,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,0.895471,2.243698
1,Aaron_Lennon,1,3,15,Everton,Tottenham Hotspur,,,True,1,0,0,6,0,0.3,0,0,0.9,8.2,0,0,0,0,0,13918,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,1.057509,1.43369
2,Aaron_Ramsey,1,3,60,Arsenal,Liverpool,,,True,2,0,0,5,0,4.9,3,0,3.0,2.2,0,0,0,0,0,163170,4,3,23.0,0,0,0,0,2016-08-14T15:00:00Z,1617,,1.944129,1.46586
3,Abdoulaye_Doucouré,1,3,0,Watford,Southampton,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1051,1,1,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,0.7042,0.796805
4,Abdul Rahman_Baba,1,2,0,Chelsea,West Ham United,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,1243,1,2,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,2.243698,0.895471
5,Abel_Hernández,1,4,90,Hull City,Leicester City,,,True,5,1,0,10,0,12.2,1,0,5.7,14.4,0,0,0,0,0,26039,1,2,30.0,0,0,0,0,2016-08-13T11:30:00Z,1617,,0.494447,0.650832
6,Adama_Diomande,1,4,90,Hull City,Leicester City,,,True,8,0,2,29,0,16.8,1,1,10.7,45.2,0,0,0,0,0,38151,1,2,45.0,0,0,0,0,2016-08-13T11:30:00Z,1617,,0.494447,0.650832
7,Adam_Clayton,1,3,90,Middlesbrough,Stoke City,,,True,2,0,0,6,0,2.2,1,0,1.4,3.2,0,0,0,0,0,17663,1,1,9.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,0.452793,0.718705
8,Adam_Federici,1,1,0,Bournemouth,Manchester United,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,4315,3,1,0.0,0,0,0,0,2016-08-14T12:30:00Z,1617,,0.384921,1.983179
9,Adam_Forshaw,1,3,69,Middlesbrough,Stoke City,,,True,1,0,0,3,0,1.3,1,0,0.3,2.0,0,0,0,0,0,2723,1,1,0.0,0,0,0,1,2016-08-13T14:00:00Z,1617,,0.452793,0.718705


Since this is a time series problem we have some functions that create various rolling totals and averages for players and teams

In [10]:
def player_lag_features(df, features, lags):    
    df_new = df.copy()
    player_lag_vars = []
    
    # need minutes for per game stats, add to front of list
    features.insert(0, 'minutes')

    # calculate totals for each lag period
    for feature in features:
        for lag in lags:
            feature_name = feature + '_last_' + str(lag)
            minute_name = 'minutes_last_' + str(lag)
            
            if lag == 'all':
                df_new[feature_name] = df_new.groupby(['player'])[feature].apply(lambda x: x.cumsum() - x)
            else: 
                df_new[feature_name] = df_new.groupby(['player'])[feature].apply(lambda x: x.rolling(min_periods=1, 
                                                                                            window=lag+1).sum() - x)
            if feature != 'minutes':

                pg_feature_name = feature + '_pg_last_' + str(lag)
                player_lag_vars.append(pg_feature_name)
                
                df_new[pg_feature_name] = 90 * df_new[feature_name] / df_new[minute_name]
                
                # some cases of -1 points and 0 minutes cause -inf values
                # change these to NaN
                df_new[pg_feature_name] = df_new[pg_feature_name].replace([np.inf, -np.inf], np.nan)
            
            else: player_lag_vars.append(minute_name)
                
    return df_new, player_lag_vars

In [69]:
# team level lag features
def team_lag_features(df, features, lags):
    team_lag_vars = []
    
    for feature in features:
        feature_team_name = feature + '_team'
        feature_conceded_team_name = feature_team_name + '_conceded'
        feature_team = (df.groupby(['team', 'season', 'gw',
                                   'kickoff_time', 'opponent_team'])
                        [feature].sum().rename(feature_team_name).reset_index())
        
        # join back for points conceded
        feature_team = feature_team.merge(feature_team,
                           left_on=['team', 'season', 'gw',
                                    'kickoff_time', 'opponent_team'],
                           right_on=['opponent_team', 'season', 'gw',
                                     'kickoff_time', 'team'],
                           how='left',
                           suffixes = ('', '_conceded'))
        
        feature_team.drop(['team_conceded', 'opponent_team_conceded'], axis=1, inplace=True)
                
        for lag in lags:
            feature_name = feature + '_team_last_' + str(lag)
            feature_conceded_name = feature + '_team_conceded_last_' + str(lag)
            pg_feature_name = feature + '_team_pg_last_' + str(lag)
            pg_feature_conceded_name = feature + '_team_conceded_pg_last_' + str(lag)
            
            team_lag_vars.extend([pg_feature_name, pg_feature_conceded_name])
            
            if lag == 'all':
                feature_team[feature_name] = (feature_team.groupby('team')[feature_team_name]
                                              .apply(lambda x: x.cumsum() - x))
                
                feature_team[feature_conceded_name] = (feature_team.groupby('team')[feature_conceded_team_name]
                                              .apply(lambda x: x.cumsum() - x))
                
                feature_team[pg_feature_name] = (feature_team[feature_name]
                                                 / feature_team.groupby('team').cumcount())
                
                feature_team[pg_feature_conceded_name] = (feature_team[feature_conceded_name]
                                                 / feature_team.groupby('team').cumcount())
                
            else:
                feature_team[feature_name] = (feature_team.groupby('team')[feature_team_name]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                
                feature_team[feature_conceded_name] = (feature_team.groupby('team')[feature_conceded_team_name]
                                              .apply(lambda x: x.rolling(min_periods=1, 
                                                                         window=lag + 1).sum() - x))
                
                feature_team[pg_feature_name] = (feature_team[feature_name] / 
                                                 feature_team.groupby('team')[feature_team_name]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
                
                feature_team[pg_feature_conceded_name] = (feature_team[feature_name] / 
                                                 feature_team.groupby('team')[feature_conceded_name]
                                                 .apply(lambda x: x.rolling(min_periods=1, 
                                                                            window=lag + 1).count() - 1))
        
        df_new = df.merge(feature_team, 
                          on=['team', 'season', 'gw', 'kickoff_time', 'opponent_team'], 
                          how='left')
        
        df_new = df_new.merge(feature_team,
                 left_on=['team', 'season', 'gw', 'kickoff_time', 'opponent_team'],
                 right_on=['opponent_team', 'season', 'gw', 'kickoff_time', 'team'],
                 how='left',
                 suffixes = ('', '_opponent'))
        
        team_lag_vars = team_lag_vars + [team_lag_var + '_opponent' for team_lag_var in team_lag_vars]
        
        df_new.drop(['team_opponent', 'opponent_team_opponent'], axis=1, inplace=True)
        
        return df_new, team_lag_vars

In [70]:
lag_train_df, team_lag_vars = team_lag_features(train_df, ['total_points'], ['all', 1, 2, 3])
# feature_team_test[(feature_team_test['team'].isin(['Arsenal', 'Leicester City'])) & (feature_team_test['season'] == '1617')].head(50)

In [73]:
team_lag_vars

['total_points_team_pg_last_all',
 'total_points_team_conceded_pg_last_all',
 'total_points_team_pg_last_1',
 'total_points_team_conceded_pg_last_1',
 'total_points_team_pg_last_2',
 'total_points_team_conceded_pg_last_2',
 'total_points_team_pg_last_3',
 'total_points_team_conceded_pg_last_3',
 'total_points_team_pg_last_all_opponent',
 'total_points_team_conceded_pg_last_all_opponent',
 'total_points_team_pg_last_1_opponent',
 'total_points_team_conceded_pg_last_1_opponent',
 'total_points_team_pg_last_2_opponent',
 'total_points_team_conceded_pg_last_2_opponent',
 'total_points_team_pg_last_3_opponent',
 'total_points_team_conceded_pg_last_3_opponent']

In [74]:
lag_train_df[lag_train_df['player'] == 'Kevin_De Bruyne']

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,total_points_team,total_points_team_conceded,total_points_team_last_all,total_points_team_conceded_last_all,total_points_team_pg_last_all,total_points_team_conceded_pg_last_all,total_points_team_last_1,total_points_team_conceded_last_1,total_points_team_pg_last_1,total_points_team_conceded_pg_last_1,total_points_team_last_2,total_points_team_conceded_last_2,total_points_team_pg_last_2,total_points_team_conceded_pg_last_2,total_points_team_last_3,total_points_team_conceded_last_3,total_points_team_pg_last_3,total_points_team_conceded_pg_last_3,total_points_team_opponent,total_points_team_conceded_opponent,total_points_team_last_all_opponent,total_points_team_conceded_last_all_opponent,total_points_team_pg_last_all_opponent,total_points_team_conceded_pg_last_all_opponent,total_points_team_last_1_opponent,total_points_team_conceded_last_1_opponent,total_points_team_pg_last_1_opponent,total_points_team_conceded_pg_last_1_opponent,total_points_team_last_2_opponent,total_points_team_conceded_last_2_opponent,total_points_team_pg_last_2_opponent,total_points_team_conceded_pg_last_2_opponent,total_points_team_last_3_opponent,total_points_team_conceded_last_3_opponent,total_points_team_pg_last_3_opponent,total_points_team_conceded_pg_last_3_opponent
297,Kevin_De Bruyne,1,3,90,Manchester City,Sunderland,,,True,2,0,0,6,0,25.9,1,0,5.2,3.2,0,0,0,0,0,176498,1,2,23.0,0,0,0,0,2016-08-13T16:30:00Z,1617,,2.311012,0.418392,37,26.0,0,0.0,,,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,,26.0,37.0,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,
826,Kevin_De Bruyne,2,3,87,Manchester City,Stoke City,,,False,4,1,0,19,0,51.8,1,0,8.5,21.2,0,0,0,0,0,199367,4,1,12.0,-7066,6203,13269,1,2016-08-20T11:30:00Z,1617,,2.311012,0.718705,57,19.0,37,26.0,37.000000,26.000000,37.0,26.0,37.0,37.0,37.0,26.0,37.0,37.0,37.0,26.0,37.000000,37.000000,19.0,57.0,28.0,31.0,28.000000,31.000000,28.0,31.0,28.0,28.0,28.0,31.0,28.0,28.0,28.0,31.0,28.000000,28.000000
1371,Kevin_De Bruyne,3,3,90,Manchester City,West Ham United,,,True,6,1,1,31,0,63.0,1,0,11.1,26.6,0,0,0,0,0,202158,1,3,21.0,-10163,10864,21027,0,2016-08-28T15:00:00Z,1617,,2.311012,0.895471,53,24.0,94,45.0,47.000000,22.500000,57.0,19.0,57.0,57.0,94.0,45.0,47.0,47.0,94.0,45.0,47.000000,47.000000,24.0,53.0,91.0,61.0,45.500000,30.500000,62.0,22.0,62.0,62.0,91.0,61.0,45.5,45.5,91.0,61.0,45.500000,45.500000
1935,Kevin_De Bruyne,4,3,89,Manchester City,Manchester United,,,False,13,1,3,47,0,75.6,1,1,16.9,48.0,0,0,0,0,0,202166,2,1,45.0,-9429,11646,21075,0,2016-09-10T11:30:00Z,1617,,2.311012,1.983179,42,19.0,147,69.0,49.000000,23.000000,53.0,24.0,53.0,53.0,110.0,43.0,55.0,55.0,147.0,69.0,49.000000,49.000000,19.0,42.0,177.0,74.0,59.000000,24.666667,57.0,23.0,57.0,57.0,127.0,44.0,63.5,63.5,177.0,74.0,59.000000,59.000000
2517,Kevin_De Bruyne,5,3,74,Manchester City,Bournemouth,,,True,14,1,3,57,1,59.0,0,1,17.3,71.6,0,0,0,0,0,372086,0,4,42.0,152780,160832,8052,0,2016-09-17T14:00:00Z,1617,,2.311012,0.384921,78,15.0,189,88.0,47.250000,22.000000,42.0,19.0,42.0,42.0,95.0,43.0,47.5,47.5,152.0,62.0,50.666667,50.666667,15.0,78.0,154.0,170.0,38.500000,42.500000,62.0,24.0,62.0,62.0,102.0,58.0,51.0,51.0,124.0,120.0,41.333333,41.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88178,Kevin_De Bruyne,35,3,63,Manchester City,Brighton and Hove Albion,2.617526,0.479304,False,3,0,0,19,1,62.0,0,0,8.1,16.2,0,0,0,0,0,3979358,5,0,3.0,3349,48252,44903,0,2020-07-11T19:00:00Z,1920,1.0,2.727025,0.476156,94,14.0,8146,4120.0,55.040541,27.837838,83.0,12.0,83.0,83.0,104.0,75.0,52.0,52.0,186.0,88.0,62.000000,62.000000,14.0,94.0,3905.0,5033.0,35.500000,45.754545,28.0,52.0,28.0,28.0,91.0,76.0,45.5,45.5,111.0,155.0,37.000000,37.000000
88837,Kevin_De Bruyne,36,3,0,Manchester City,Bournemouth,2.448707,0.615026,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,3885423,1,2,0.0,-55768,36954,92722,0,2020-07-15T17:00:00Z,1920,1.0,2.727025,0.687124,46,30.0,8240,4134.0,55.302013,27.744966,94.0,14.0,94.0,94.0,177.0,26.0,88.5,88.5,198.0,89.0,66.000000,66.000000,30.0,46.0,5552.0,6592.0,37.261745,44.241611,50.0,11.0,50.0,50.0,98.0,59.0,49.0,49.0,127.0,124.0,42.333333,42.333333
89500,Kevin_De Bruyne,37,3,90,Manchester City,Watford,2.468858,0.504811,False,6,1,0,32,1,83.1,0,0,16.0,39.4,0,0,0,0,0,3791088,4,0,37.0,-121257,25174,146431,0,2020-07-21T17:00:00Z,1920,1.0,2.727025,0.555819,83,22.0,8286,4164.0,55.240000,27.760000,46.0,30.0,46.0,46.0,140.0,44.0,70.0,70.0,223.0,56.0,74.333333,74.333333,22.0,83.0,5343.0,6850.0,35.620000,45.666667,27.0,53.0,27.0,27.0,69.0,80.0,34.5,34.5,111.0,106.0,37.000000,37.000000
90166,Kevin_De Bruyne,38,3,90,Manchester City,Norwich,2.430397,0.327574,True,19,1,3,70,1,133.5,0,2,30.1,113.2,0,0,0,0,0,3658668,0,5,54.0,-97450,45469,142919,0,2020-07-26T15:00:00Z,1920,1.0,2.727025,0.198300,81,17.0,8369,4186.0,55.423841,27.721854,83.0,22.0,83.0,83.0,129.0,52.0,64.5,64.5,223.0,66.0,74.333333,74.333333,17.0,81.0,1123.0,1958.0,30.351351,52.918919,13.0,62.0,13.0,13.0,39.0,124.0,19.5,19.5,56.0,213.0,18.666667,18.666667


In [75]:
lag_train_df[lag_train_df['player'] == 'Antonio_Valencia']

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,total_points_team,total_points_team_conceded,total_points_team_last_all,total_points_team_conceded_last_all,total_points_team_pg_last_all,total_points_team_conceded_pg_last_all,total_points_team_last_1,total_points_team_conceded_last_1,total_points_team_pg_last_1,total_points_team_conceded_pg_last_1,total_points_team_last_2,total_points_team_conceded_last_2,total_points_team_pg_last_2,total_points_team_conceded_pg_last_2,total_points_team_last_3,total_points_team_conceded_last_3,total_points_team_pg_last_3,total_points_team_conceded_pg_last_3,total_points_team_opponent,total_points_team_conceded_opponent,total_points_team_last_all_opponent,total_points_team_conceded_last_all_opponent,total_points_team_pg_last_all_opponent,total_points_team_conceded_pg_last_all_opponent,total_points_team_last_1_opponent,total_points_team_conceded_last_1_opponent,total_points_team_pg_last_1_opponent,total_points_team_conceded_pg_last_1_opponent,total_points_team_last_2_opponent,total_points_team_conceded_last_2_opponent,total_points_team_pg_last_2_opponent,total_points_team_conceded_pg_last_2_opponent,total_points_team_last_3_opponent,total_points_team_conceded_last_3_opponent,total_points_team_pg_last_3_opponent,total_points_team_conceded_pg_last_3_opponent
46,Antonio_Valencia,1,2,90,Manchester United,Bournemouth,,,False,2,0,0,12,0,18.3,1,0,4.7,22.8,0,0,0,0,0,291254,3,1,6.0,0,0,0,0,2016-08-14T12:30:00Z,1617,,1.983179,0.384921,50,30.0,0,0.0,,,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,,30.0,50.0,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,,0.0,0.0,,
571,Antonio_Valencia,2,2,90,Manchester United,Southampton,,,True,6,0,0,25,1,7.5,0,0,2.8,16.2,0,0,0,0,0,340941,0,2,4.0,13032,23352,10320,0,2016-08-19T19:00:00Z,1617,,1.983179,0.796805,70,21.0,50,30.0,50.000000,30.000000,50.0,30.0,50.0,50.0,50.0,30.0,50.0,50.0,50.0,30.0,50.000000,50.000000,21.0,70.0,31.0,34.0,31.000000,34.000000,31.0,34.0,31.0,31.0,31.0,34.0,31.0,31.0,31.0,34.0,31.000000,31.000000
1108,Antonio_Valencia,3,2,90,Manchester United,Hull City,,,False,9,0,3,38,1,50.6,0,0,10.1,24.0,0,0,0,0,0,371930,1,0,26.0,13410,31224,17814,0,2016-08-27T16:30:00Z,1617,,1.983179,0.494447,57,23.0,120,51.0,60.000000,25.500000,70.0,21.0,70.0,70.0,120.0,51.0,60.0,60.0,120.0,51.0,60.000000,60.000000,23.0,57.0,101.0,48.0,50.500000,24.000000,63.0,20.0,63.0,63.0,101.0,48.0,50.5,50.5,101.0,48.0,50.500000,50.500000
1660,Antonio_Valencia,4,2,90,Manchester United,Manchester City,,,True,1,0,0,13,0,6.7,2,0,2.9,20.6,0,0,0,0,0,488199,2,1,2.0,84199,109077,24878,0,2016-09-10T11:30:00Z,1617,,1.983179,2.311012,19,42.0,177,74.0,59.000000,24.666667,57.0,23.0,57.0,57.0,127.0,44.0,63.5,63.5,177.0,74.0,59.000000,59.000000,42.0,19.0,147.0,69.0,49.000000,23.000000,53.0,24.0,53.0,53.0,110.0,43.0,55.0,55.0,147.0,69.0,49.000000,49.000000
2241,Antonio_Valencia,5,2,61,Manchester United,Watford,,,False,2,0,0,13,0,6.3,1,0,2.5,15.0,0,0,0,0,0,514792,1,3,4.0,14748,29513,14765,0,2016-09-18T11:00:00Z,1617,,1.983179,0.704200,23,51.0,196,116.0,49.000000,29.000000,19.0,42.0,19.0,19.0,76.0,65.0,38.0,38.0,146.0,86.0,48.666667,48.666667,51.0,23.0,131.0,156.0,32.750000,39.000000,50.0,32.0,50.0,50.0,68.0,88.0,34.0,34.0,97.0,125.0,32.333333,32.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65302,Antonio_Valencia,35,2,0,Manchester United,Everton,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,41295,0,4,0.0,-154,62,216,0,2019-04-21T12:30:00Z,1819,,2.015531,1.039221,13,83.0,5212,3700.0,47.816514,33.944954,43.0,32.0,43.0,43.0,64.0,72.0,32.0,32.0,109.0,102.0,36.333333,36.333333,83.0,13.0,4608.0,4425.0,41.890909,40.227273,20.0,74.0,20.0,20.0,84.0,94.0,42.0,42.0,156.0,112.0,52.000000,52.000000
65303,Antonio_Valencia,35,2,0,Manchester United,Manchester City,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,41295,2,0,0.0,-154,62,216,0,2019-04-24T19:00:00Z,1819,,2.015531,2.540586,18,66.0,5225,3783.0,47.500000,34.390909,13.0,83.0,13.0,13.0,56.0,115.0,28.0,28.0,77.0,155.0,25.666667,25.666667,66.0,18.0,6105.0,2962.0,55.500000,26.927273,60.0,24.0,60.0,60.0,114.0,54.0,57.0,57.0,185.0,73.0,61.666667,61.666667
66140,Antonio_Valencia,36,2,0,Manchester United,Chelsea,,,True,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,41234,1,1,0.0,-110,24,134,0,2019-04-28T15:30:00Z,1819,,2.015531,2.540586,35,35.0,5243,3849.0,47.234234,34.675676,18.0,66.0,18.0,18.0,31.0,149.0,15.5,15.5,74.0,181.0,24.666667,24.666667,35.0,35.0,5641.0,3671.0,50.819820,33.072072,34.0,33.0,34.0,34.0,52.0,106.0,26.0,26.0,122.0,123.0,40.666667,40.666667
66755,Antonio_Valencia,37,2,0,Manchester United,Huddersfield Town,,,False,0,0,0,0,0,0.0,0,0,0.0,0.0,0,0,0,0,0,41199,1,1,0.0,-91,32,123,0,2019-05-05T13:00:00Z,1819,,2.015531,0.273778,30,35.0,5278,3884.0,47.125000,34.678571,35.0,35.0,35.0,35.0,53.0,101.0,26.5,26.5,66.0,184.0,22.000000,22.000000,35.0,30.0,2298.0,3893.0,31.054054,52.608108,12.0,96.0,12.0,12.0,39.0,138.0,19.5,19.5,50.0,227.0,16.666667,16.666667


For example, the following creates totals and per game (per 90 mins) averages for points going back 1, 2, 3, 4, 5, 10, 20 and all previous weeks. This is done at both player and team level.

This has been checked for points totals, but should also work for any other stat such as player/team goals scored, assists, player goals conceded. However, it will not currently work for team level stats such as team goals conceded where adding up the goals conceded across all the team players would be incorrect.

In [12]:
# create some lag features
lag_train_df, team_lag_vars = team_lag_features(train_df, ['total_points'], ['all', 1, 2, 3, 4, 5, 10, 20])
lag_train_df, player_lag_vars = player_lag_features(lag_train_df, ['total_points'], ['all', 1, 2, 3, 4, 5, 10, 20])

You can see below that the player's (Salah) historic point totals and per game totals are given, as well as the totals for his team (Liverpool) and whichever team he is playing in that gameweek (e.g. his debut was versus Watord on the 12th August 2017, so Watford's running point totals and per game totals are also given).

Note that if it is the first game since the start of the 2016/17 season for the team or opposition, then the point totals for previous games will be 0 and the per game totals will be null. If the player has not had any minutes in the previous number of games being calculated, again
the point totals will also be 0, and per game totals null.

In [24]:
lag_train_df.shape

(91401, 95)

In [13]:
# look at resulting dataset for a player
lag_train_df[lag_train_df['player'] == 'Mohamed_Salah']

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,total_points_team,total_points_team_last_all,total_points_team_pg_last_all,total_points_team_last_1,total_points_team_pg_last_1,total_points_team_last_2,total_points_team_pg_last_2,total_points_team_last_3,total_points_team_pg_last_3,total_points_team_last_4,total_points_team_pg_last_4,total_points_team_last_5,total_points_team_pg_last_5,total_points_team_last_10,total_points_team_pg_last_10,total_points_team_last_20,total_points_team_pg_last_20,total_points_team_opponent,total_points_team_last_all_opponent,total_points_team_pg_last_all_opponent,total_points_team_last_1_opponent,total_points_team_pg_last_1_opponent,total_points_team_last_2_opponent,total_points_team_pg_last_2_opponent,total_points_team_last_3_opponent,total_points_team_pg_last_3_opponent,total_points_team_last_4_opponent,total_points_team_pg_last_4_opponent,total_points_team_last_5_opponent,total_points_team_pg_last_5_opponent,total_points_team_last_10_opponent,total_points_team_pg_last_10_opponent,total_points_team_last_20_opponent,total_points_team_pg_last_20_opponent,minutes_last_all,minutes_last_1,minutes_last_2,minutes_last_3,minutes_last_4,minutes_last_5,minutes_last_10,minutes_last_20,total_points_last_all,total_points_pg_last_all,total_points_last_1,total_points_pg_last_1,total_points_last_2,total_points_pg_last_2,total_points_last_3,total_points_pg_last_3,total_points_last_4,total_points_pg_last_4,total_points_last_5,total_points_pg_last_5,total_points_last_10,total_points_pg_last_10,total_points_last_20,total_points_pg_last_20
24036,Mohamed_Salah,1,3,85,Liverpool,Watford,,,False,11,1,1,26,0,2.8,2,1,8.2,24.6,0,0,0,0,0,874608,3,3,55.0,0,0,0,0,2017-08-12T11:30:00Z,1718,,1.619155,0.547242,44,1863,49.026316,78.0,78.0,159.0,79.5,206.0,68.666667,266.0,66.50,292.0,58.4,510.0,51.0,916.0,45.80,43.0,1264.0,33.263158,15.0,15.0,41.0,20.5,63.0,21.000000,84.0,21.00,108.0,21.6,319.0,31.9,633.0,31.65,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,
24551,Mohamed_Salah,2,3,29,Liverpool,Crystal Palace,,,True,1,0,0,0,0,12.3,0,0,5.1,10.4,0,0,0,0,0,1293309,0,1,28.0,175914,193660,17746,0,2017-08-19T14:00:00Z,1718,,1.619155,0.635984,60,1907,48.897436,44.0,44.0,122.0,61.0,203.0,67.666667,250.0,62.50,310.0,62.0,522.0,52.2,901.0,45.05,29.0,1427.0,36.589744,15.0,15.0,34.0,17.0,116.0,38.666667,129.0,32.25,146.0,29.2,356.0,35.6,769.0,38.45,85,85.0,85.0,85.0,85.0,85.0,85.0,85.0,11,11.647059,11.0,11.647059,11.0,11.647059,11.0,11.647059,11.0,11.647059,11.0,11.647059,11.0,11.647059,11.0,11.647059
25076,Mohamed_Salah,3,3,90,Liverpool,Arsenal,,,True,11,1,0,39,1,25.3,0,1,19.9,70.4,0,0,0,0,0,1158692,0,4,103.0,-184736,27792,212528,0,2017-08-27T15:00:00Z,1718,,1.619155,2.073500,81,1967,49.175000,60.0,60.0,104.0,52.0,182.0,60.666667,263.0,65.75,310.0,62.0,530.0,53.0,931.0,46.55,12.0,1956.0,48.900000,26.0,26.0,78.0,39.0,128.0,42.666667,198.0,49.50,258.0,51.6,526.0,52.6,958.0,47.90,114,29.0,114.0,114.0,114.0,114.0,114.0,114.0,12,9.473684,1.0,3.103448,12.0,9.473684,12.0,9.473684,12.0,9.473684,12.0,9.473684,12.0,9.473684,12.0,9.473684
25614,Mohamed_Salah,4,3,45,Liverpool,Manchester City,,,False,1,0,0,4,0,13.8,1,0,4.7,7.8,0,0,0,0,0,1422941,0,5,25.0,177596,238283,60687,0,2017-09-09T11:30:00Z,1718,,1.619155,2.016093,8,2048,49.951220,81.0,81.0,141.0,70.5,185.0,61.666667,263.0,65.75,344.0,68.8,577.0,57.7,982.0,49.10,88.0,1985.0,48.414634,39.0,39.0,63.0,31.5,130.0,43.333333,220.0,55.00,279.0,55.8,578.0,57.8,1080.0,54.00,204,90.0,119.0,204.0,204.0,204.0,204.0,204.0,23,10.147059,11.0,11.000000,12.0,9.075630,23.0,10.147059,23.0,10.147059,23.0,10.147059,23.0,10.147059,23.0,10.147059
26160,Mohamed_Salah,5,3,90,Liverpool,Burnley,,,True,10,0,3,27,0,35.8,1,1,17.4,51.2,0,0,0,0,0,1571656,1,1,87.0,122769,224328,101559,0,2017-09-16T14:00:00Z,1718,,1.619155,0.316798,34,2056,48.952381,8.0,8.0,89.0,44.5,149.0,49.666667,193.0,48.25,271.0,54.2,545.0,54.5,952.0,47.60,37.0,1540.0,36.666667,53.0,53.0,88.0,44.0,111.0,37.000000,154.0,38.50,181.0,36.2,347.0,34.7,736.0,36.80,249,45.0,135.0,164.0,249.0,249.0,249.0,249.0,24,8.674699,1.0,2.000000,12.0,8.000000,13.0,7.134146,24.0,8.674699,24.0,8.674699,24.0,8.674699,24.0,8.674699
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88528,Mohamed_Salah,36,3,82,Liverpool,Arsenal,2.458682,1.500845,False,2,0,0,-3,0,3.3,2,0,9.1,7.4,0,0,0,0,0,2609146,1,2,80.0,-111419,24828,136247,0,2020-07-15T19:15:00Z,1920,1.0,2.297572,1.448866,31,8120,54.496644,43.0,43.0,95.0,47.5,169.0,56.333333,182.0,45.50,273.0,54.6,494.0,49.4,1191.0,59.55,38.0,6792.0,45.583893,27.0,27.0,61.0,30.5,127.0,42.333333,201.0,50.25,268.0,53.6,508.0,50.8,891.0,44.55,8852,90.0,180.0,270.0,360.0,450.0,810.0,1683.0,787,8.001582,2.0,2.000000,20.0,10.000000,26.0,8.666667,28.0,7.000000,39.0,7.800000,60.0,6.666667,150.0,8.021390
89191,Mohamed_Salah,37,3,78,Liverpool,Chelsea,2.451453,2.064820,True,5,1,0,15,0,16.8,3,0,9.5,25.6,0,0,0,0,0,2559844,3,5,53.0,-69649,24957,94606,0,2020-07-22T19:15:00Z,1920,1.0,2.297572,1.798870,64,8151,54.340000,31.0,31.0,74.0,37.0,126.0,42.000000,200.0,50.00,213.0,42.6,466.0,46.6,1148.0,57.40,33.0,7388.0,49.253333,62.0,62.0,82.0,41.0,129.0,43.000000,210.0,52.50,247.0,49.4,500.0,50.0,890.0,44.50,8934,82.0,172.0,262.0,352.0,442.0,802.0,1675.0,789,7.948287,2.0,2.195122,4.0,2.093023,22.0,7.557252,28.0,7.159091,30.0,6.108597,59.0,6.620948,139.0,7.468657
89856,Mohamed_Salah,38,3,26,Liverpool,Newcastle United,2.406333,0.611449,False,1,0,0,2,0,24.0,0,0,6.0,1.6,0,0,0,0,0,2457006,3,1,34.0,-102494,31724,134218,0,2020-07-26T15:00:00Z,1920,1.0,2.297572,0.542356,54,8215,54.403974,64.0,64.0,95.0,47.5,138.0,46.000000,190.0,47.50,264.0,52.8,482.0,48.2,1147.0,57.35,30.0,4373.0,38.699115,52.0,52.0,77.0,38.5,104.0,34.666667,116.0,29.00,157.0,31.4,456.0,45.6,802.0,40.10,9012,78.0,160.0,250.0,340.0,430.0,790.0,1663.0,794,7.929427,5.0,5.769231,7.0,3.937500,9.0,3.240000,27.0,7.147059,33.0,6.906977,57.0,6.493671,128.0,6.927240
90738,Mohamed_Salah,1,3,90,Liverpool,Leeds,2.394822,0.300409,True,20,0,3,69,0,50.1,3,3,32.8,117.2,0,0,0,0,0,1883241,3,4,161.0,0,0,0,0,2020-09-12T16:30:00Z,2021,1.0,2.394822,0.300409,48,8269,54.401316,54.0,54.0,118.0,59.0,149.0,49.666667,192.0,48.00,244.0,48.8,516.0,51.6,1115.0,55.75,42.0,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,9038,26.0,104.0,186.0,276.0,366.0,726.0,1620.0,795,7.916574,1.0,3.461538,6.0,5.192308,8.0,3.870968,10.0,3.260870,28.0,6.885246,56.0,6.942149,126.0,7.000000


In [22]:
# look at resulting dataset for a player
lag_train_df[(lag_train_df['player'] == 'Héctor_Bellerín') & (lag_train_df['season'] == '1718')].head()

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,assists,bonus,bps,clean_sheets,creativity,goals_conceded,goals_scored,ict_index,influence,own_goals,penalties_missed,penalties_saved,red_cards,saves,selected,team_a_score,team_h_score,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season,total_points_team,total_points_team_last_all,total_points_team_pg_last_all,total_points_team_last_1,total_points_team_pg_last_1,total_points_team_last_2,total_points_team_pg_last_2,total_points_team_last_3,total_points_team_pg_last_3,total_points_team_last_4,total_points_team_pg_last_4,total_points_team_last_5,total_points_team_pg_last_5,total_points_team_last_10,total_points_team_pg_last_10,total_points_team_last_20,total_points_team_pg_last_20,total_points_team_opponent,total_points_team_last_all_opponent,total_points_team_pg_last_all_opponent,total_points_team_last_1_opponent,total_points_team_pg_last_1_opponent,total_points_team_last_2_opponent,total_points_team_pg_last_2_opponent,total_points_team_last_3_opponent,total_points_team_pg_last_3_opponent,total_points_team_last_4_opponent,total_points_team_pg_last_4_opponent,total_points_team_last_5_opponent,total_points_team_pg_last_5_opponent,total_points_team_last_10_opponent,total_points_team_pg_last_10_opponent,total_points_team_last_20_opponent,total_points_team_pg_last_20_opponent,minutes_last_all,minutes_last_1,minutes_last_2,minutes_last_3,minutes_last_4,minutes_last_5,minutes_last_10,minutes_last_20,total_points_last_all,total_points_pg_last_all,total_points_last_1,total_points_pg_last_1,total_points_last_2,total_points_pg_last_2,total_points_last_3,total_points_pg_last_3,total_points_last_4,total_points_pg_last_4,total_points_last_5,total_points_pg_last_5,total_points_last_10,total_points_pg_last_10,total_points_last_20,total_points_pg_last_20
23870,Héctor_Bellerín,1,2,90,Arsenal,Leicester City,,,True,1,0,0,9,0,27.5,3,0,7.0,17.2,0,0,0,0,0,572986,3,4,25.0,0,0,0,0,2017-08-11T18:45:00Z,1718,,2.0735,0.824624,52,1878,49.421053,50.0,50.0,120.0,60.0,180.0,60.0,251.0,62.75,323.0,64.6,540.0,54.0,991.0,49.55,40.0,1392.0,36.631579,30.0,30.0,49.0,24.5,73.0,24.333333,151.0,37.75,212.0,42.4,440.0,44.0,789.0,39.45,2503,90.0,180.0,270.0,325.0,332.0,619.0,1177.0,119,4.278865,8.0,8.0,13.0,6.5,22.0,7.333333,22.0,6.092308,23.0,6.23494,38.0,5.52504,57.0,4.358539
24383,Héctor_Bellerín,2,2,90,Arsenal,Stoke City,,,False,2,0,0,8,0,12.8,1,0,5.6,7.4,0,0,0,0,0,628098,0,1,36.0,-28590,14368,42958,0,2017-08-19T16:30:00Z,1718,,2.0735,0.581587,26,1930,49.487179,52.0,52.0,102.0,51.0,172.0,57.333333,232.0,58.0,303.0,60.6,518.0,51.8,975.0,48.75,65.0,1439.0,36.897436,24.0,24.0,83.0,41.5,101.0,33.666667,129.0,32.25,182.0,36.4,315.0,31.5,733.0,36.65,2593,90.0,180.0,270.0,360.0,415.0,619.0,1177.0,120,4.16506,1.0,1.0,9.0,4.5,14.0,4.666667,23.0,5.75,23.0,4.987952,34.0,4.943457,50.0,3.82328
24906,Héctor_Bellerín,3,2,90,Arsenal,Liverpool,,,False,0,0,0,10,0,2.9,4,0,0.6,1.0,0,0,0,0,0,579314,0,4,2.0,-67892,7781,75673,0,2017-08-27T15:00:00Z,1718,,2.0735,1.619155,12,1956,48.9,26.0,26.0,78.0,39.0,128.0,42.666667,198.0,49.5,258.0,51.6,526.0,52.6,958.0,47.9,81.0,1967.0,49.175,60.0,60.0,104.0,52.0,182.0,60.666667,263.0,65.75,310.0,62.0,530.0,53.0,931.0,46.55,2683,90.0,180.0,270.0,360.0,450.0,619.0,1177.0,122,4.092434,2.0,2.0,3.0,1.5,11.0,3.666667,16.0,4.0,25.0,5.0,35.0,5.088853,52.0,3.976211
25441,Héctor_Bellerín,4,2,90,Arsenal,Bournemouth,,,True,6,0,0,26,1,28.3,0,0,4.7,12.4,0,0,0,0,0,533242,0,3,6.0,-59067,13066,72133,0,2017-09-09T14:00:00Z,1718,,2.0735,0.379765,79,1968,48.0,12.0,12.0,38.0,19.0,90.0,30.0,140.0,35.0,210.0,42.0,495.0,49.5,895.0,44.75,18.0,1613.0,39.341463,21.0,21.0,39.0,19.5,64.0,21.333333,100.0,25.0,148.0,29.6,372.0,37.2,747.0,37.35,2773,90.0,180.0,270.0,360.0,450.0,708.0,1267.0,122,3.959611,0.0,0.0,2.0,1.0,3.0,1.0,11.0,2.75,16.0,3.2,34.0,4.322034,52.0,3.693765
25987,Héctor_Bellerín,5,2,90,Arsenal,Chelsea,,,False,5,0,0,27,1,29.8,0,0,4.3,11.4,0,0,0,0,0,508976,0,0,2.0,-30945,3865,34810,1,2017-09-17T12:30:00Z,1718,,2.0735,2.125018,57,2047,48.738095,79.0,79.0,91.0,45.5,117.0,39.0,169.0,42.25,219.0,43.8,512.0,51.2,928.0,46.4,48.0,2251.0,53.595238,46.0,46.0,119.0,59.5,164.0,54.666667,190.0,47.5,259.0,51.8,590.0,59.0,986.0,49.3,2863,90.0,180.0,270.0,360.0,450.0,708.0,1355.0,128,4.023751,6.0,6.0,6.0,3.0,8.0,2.666667,9.0,2.25,17.0,3.4,33.0,4.194915,57.0,3.785978


In [14]:
# summary with lag features added
lag_train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91401 entries, 0 to 91400
Data columns (total 95 columns):
player                                        91401 non-null object
gw                                            91401 non-null int64
position                                      91401 non-null int64
minutes                                       91401 non-null int64
team                                          91401 non-null object
opponent_team                                 91401 non-null object
relative_market_value_team                    23465 non-null float64
relative_market_value_opponent_team           23465 non-null float64
was_home                                      91401 non-null bool
total_points                                  91401 non-null int64
assists                                       91401 non-null int64
bonus                                         91401 non-null int64
bps                                           91401 non-null int64
clean_sheets  

We also return a lists for the player lag and team lag variables which will help us when modelling. For now these only have the <i>per game</i> calculated values and minutes (for players).

In [11]:
player_lag_vars

['minutes_last_all',
 'minutes_last_1',
 'minutes_last_2',
 'minutes_last_3',
 'minutes_last_4',
 'minutes_last_5',
 'minutes_last_10',
 'minutes_last_20',
 'total_points_pg_last_all',
 'total_points_pg_last_1',
 'total_points_pg_last_2',
 'total_points_pg_last_3',
 'total_points_pg_last_4',
 'total_points_pg_last_5',
 'total_points_pg_last_10',
 'total_points_pg_last_20']

In [12]:
team_lag_vars

['total_points_team_pg_last_all',
 'total_points_team_pg_last_1',
 'total_points_team_pg_last_2',
 'total_points_team_pg_last_3',
 'total_points_team_pg_last_4',
 'total_points_team_pg_last_5',
 'total_points_team_pg_last_10',
 'total_points_team_pg_last_20',
 'total_points_team_pg_last_all_opponent',
 'total_points_team_pg_last_1_opponent',
 'total_points_team_pg_last_2_opponent',
 'total_points_team_pg_last_3_opponent',
 'total_points_team_pg_last_4_opponent',
 'total_points_team_pg_last_5_opponent',
 'total_points_team_pg_last_10_opponent',
 'total_points_team_pg_last_20_opponent']

Now we have an easy way of getting the points per game total for any player at any point in time. Here is Salah at the last gameweek in the 2019/20 season.

In [13]:
lag_train_df[(lag_train_df['season'] == '1920') & 
             (lag_train_df['gw'] == 38) & 
             (lag_train_df['player'] == 'Mohamed_Salah')]['total_points_pg_last_all'].mean()

7.9294274300932095

Here is a check that summing up and dividing all points and minutes to that point in time gives the same answer.

In [14]:
(train_df[:89771][train_df[:89771]['player'] == 'Mohamed_Salah']['total_points'].sum() * 90 
 / train_df[:89771][train_df[:89771]['player'] == 'Mohamed_Salah']['minutes'].sum())

7.9294274300932095

And we can use the same approach to see the average points per game (per 90 minutes) across all players. We'll use this in the simple baseline model.

In [15]:
# points per minute across all players and minutes
(train_df['total_points'].sum() * 90 / train_df['minutes'].sum())

3.7388996509104

But need to be somewhat aware that players with appearances with predominantly low number of minutes may have artificially high point per minute values due to the fact that they will get at least 1 point over 1-10 minutes of time

In [16]:
# extreme example of points per minute for all appearances under 10 minutes
(train_df[train_df['minutes'] < 10]['total_points'].sum() * 90 / train_df[train_df['minutes'] < 10]['minutes'].sum())

22.189395937747296

The performance of any model may vary across the season. For example, performance may be worse at the start of the season due to new players / teams, and big changes within existing teams that aren't captured in the historical data. Also, we will be generating forecasts at any one point for the remainder of the season - if we're at gameweek 1, then a forecast for gameweek 2 is likely to be more accurate than for gaemweek 10.

We therefore need a sensible way to combine these different situations into our validation of models.

The standard way to do this with a time series problem is assess the model on a sequence of time steps. In our case, we will do this for the most recent complete season, starting at gameweek 1 and moving through to the end of the season. In FPL we are also generally more concerned with the near future, so we'll only assess the performance of the next 6 fixtures.

Here's how the validation looks in practice:

1. Train using all data up to but not including gw 1; use model to predict gw 1-6; calculate rmse for gw 1-6 predictions
2. Train using all data up to but not including gw 2; use model to predict gw 2-7; calculate rmse for gw 2-7 predictions
3. Train using all data up to but not including gw 3; use model to predict gw 3-8; calculate rmse for gw 3-8 predictions

.. repeat until...

33. Train using all data up to but not including gw 3; use model to predict gw 33-38; calculate rmse for gw 33-38 predictions

We can then look at how the performance varies across the validation season, as well as averaging performance across all weeks to give us a single validation number for each model.

It will be helpful to have a function that returns indexes for the start and end of a validation periods, given a season, gameweek and length of validation.

In [17]:
# validation set indexes
# training will always be from start of data up to valid-start
def validation_gw_idx(df, season, gw, length):
    
    valid_start = df[(df['gw'] == gw) & (df['season'] == season)].index.min()
    valid_end = df[(df['gw'] == min(gw+length-1, 38)) & (df['season'] == season)].index.max()

    return (valid_start, valid_end)

In [18]:
# try it
validation_gw_idx(train_df, '1920', 1, 6)

(67936, 71131)

Now that we have a good sense of the dataset, created a few extra time-series features, and decided on a validation approach, it's time to create a simple baseline model in the next notebook.

Note: We want to use the above functions in subsequent notebooks, so to avoid having to write them out again they have been added to the helper.py module, and can be imported into any notebook by running:

```from helpers import *```

This is the case for all functions in this series of notebooks.