In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
# path to project directory
path = Path('./')

In [101]:
# read in training dataset
train_df = pd.read_csv(path/'data/train_v4.csv', index_col=0, dtype={'season':str})

These are the fields in the base dataset, all from fpl and transfermarkt

In [102]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90437 entries, 0 to 90436
Data columns (total 37 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   player                                      90437 non-null  object 
 1   gw                                          90437 non-null  int64  
 2   position                                    90437 non-null  int64  
 3   minutes                                     90437 non-null  int64  
 4   team                                        90437 non-null  object 
 5   opponent_team                               90437 non-null  object 
 6   relative_market_value_team                  22501 non-null  float64
 7   relative_market_value_opponent_team         22501 non-null  float64
 8   was_home                                    90437 non-null  bool   
 9   total_points                                90437 non-null  int64  
 10  assists   

Each row represents one player's performance in a single fixture, and will be unique across the player name and kickoff time fields:

- player (player name)
- kickoff_time (kickoff time for the fixture)

The fixtures are futher defined with the following fields:

- team (the player's team)
- opponent_team (the opposition team)
- was_home (was it a home game for the player)
- season (e.g. '1920' for the 2019/20 season)
- gw (the FPL gameweek in which the fixture occured)

Note that there can be multiple fixtures (i.e. rows for a given player) in a single gameweek.

The position that a player plays is also given, this will be consistent for each player within seasons, but may change between seasons:

- position (1 - goalkeeper, 2 - defender, 3 - midfielder, 4 - forward)

Most of the other fields describe the player (or team's) performance in the fixture e.g. the number of munites played, points scored, assists, goals, goals conceded while on the field, etc.

All the above should be 100% complete for all rows.

Incomplete fields for FPL data are:

- transfer values (transfers_in, transfers_out, transfers_balance) - these were only collected from the start of the 2019/20 season, and require further investigation as to what they actually represent (in other words, treat with caution when modelling); values prior to the 2019/20 are set to 0
- play_proba - again only collected from the start of the 2019/20 season, this is the probability that the the player would actually be available for the fixture according to the FPL website (note that the time that this is captured each week  varies); values prior to the 2019/20 are null, and they are also null for any new players in a given gameweek (i.e. players that FPL has added to the game during that gameweek)

Finally, team transfer market value is taken from transfermarkt each week (for the 2019/20) season or a single value has been taken for the whole season:

- relative_market_value_team - the market value for the team scraped during that gameweek (non null from start of 2019/20 season)
- relative_market_value_opponent_team - the market value for the opposition team scraped during that gameweek (non null from start of 2019/20 season)
- relative_market_value_team_season - a single value for the team's value at the the start of each season 
- relative_market_value_team_season - a single value for the opposition team's value at the the start of each season 




In [112]:
train_df

Unnamed: 0,player,gw,position,minutes,team,opponent_team,relative_market_value_team,relative_market_value_opponent_team,was_home,total_points,...,threat,transfers_balance,transfers_in,transfers_out,yellow_cards,kickoff_time,season,play_proba,relative_market_value_team_season,relative_market_value_opponent_team_season
0,Aaron_Cresswell,1,2,0,West Ham United,Chelsea,,,False,0,...,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,0.895471,2.243698
1,Aaron_Lennon,1,3,15,Everton,Tottenham Hotspur,,,True,1,...,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,1.057509,1.433690
2,Aaron_Ramsey,1,3,60,Arsenal,Liverpool,,,True,2,...,23.0,0,0,0,0,2016-08-14T15:00:00Z,1617,,1.944129,1.465860
3,Abdoulaye_Doucouré,1,3,0,Watford,Southampton,,,False,0,...,0.0,0,0,0,0,2016-08-13T14:00:00Z,1617,,0.704200,0.796805
4,Abdul Rahman_Baba,1,2,0,Chelsea,West Ham United,,,True,0,...,0.0,0,0,0,0,2016-08-15T19:00:00Z,1617,,2.243698,0.895471
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90432,Tommy_Doyle,38,3,0,Manchester City,Norwich,2.430397,0.327574,True,0,...,0.0,-2,22,24,0,2020-07-26T15:00:00Z,1920,1.0,2.727025,0.198300
90433,Joseph_Anang,38,1,0,West Ham United,Aston Villa,0.709989,0.553818,True,0,...,0.0,70,270,200,0,2020-07-26T15:00:00Z,1920,1.0,0.739196,0.338194
90434,Erik_Pieters,38,2,90,Burnley,Brighton and Hove Albion,0.370648,0.541184,True,3,...,2.0,139816,144388,4572,1,2020-07-26T15:00:00Z,1920,1.0,0.441799,0.476156
90435,Japhet_Tanganga,38,2,0,Tottenham Hotspur,Crystal Palace,1.604904,0.430493,False,0,...,0.0,7999,14840,6841,0,2020-07-26T15:00:00Z,1920,1.0,2.113981,0.495374
