# Data Processing

Here, we build the pipeline to perform the necessary feature transformations prior to model fitting. The necessary changes are listed below. 

* Removing observations where 'Minutes Played' == 0
* Include 'Venue' as important feature
* Include 'Minutes Played' as important feature
* One-hot encode 'Position', and then further filter into 'Defenders', 'Midfielders' and 'Attackers'
* Create 'Season' feature using 'kickoff_time', as this feature is necessary to calculate 'Designated Penalty Takers' 
* Use 'Penalties Attempted' to calculate 'Designated Penalty Takers' 
* Include 'Shots on Target' as important feature 
* Include 'npxG' as important feature
* Include 'Penalty Area Touches' as important feature
* Compute 'Rolling xG' which will replace 'npxG', then drop 'npxG'
* Calculate 'Team Rolling xG Matchups' by first using team data to first calculate 'Team Rolling xG' and 'Team Rolling xGA'. We can then calculate 'Team Rolling xG Difference', which allows us to calculate 'Team Rolling xG Matchups'
* Calculate 'Rolling Shots on Target' to replace 'Shots on Target', then drop 'Shots on Target'
* Calculate 'Rolling Penalty Area Touches' to replace 'Penalty Area Touches', then drop 'Penalty Area Touches'

This should leave the final dataframe with the following features - Venue, Designated Penalty Taker, Rolling Shots on Target, Rolling xG, Rolling Penalty Area Touches, Team Rolling xG Matchup, Defenders, Midfielders, Attackers


In [1]:
#import necessary packages
import pandas as pd 
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline, FunctionTransformer


In [2]:
#import relevant dataframes from source
att_train = pd.read_csv('att_explore_original.csv', index_col = 0)
att_test = pd.read_csv('att_test.csv', index_col = 0)

In [19]:
#function to select specific columns
def select_columns(dataframe):
    columns = ['Player ID', 'Team', 'Opponent', 'Venue', 'Goals', 'Minutes Played', 'Position', 'kickoff_time', 'Penalties Attempted', 
               'Shots on Target', 'npxG', 'Penalty Area Touches'] 
    return dataframe[columns].copy()

#transformer to drop rows with empty 'Position'
class DropEmptyPositions(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[X['Position'] != '0']

#transformer for one-hot encoding and creating position categories
class PositionEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        #one hot encode positions
        positions_encode = X['Position'].str.get_dummies(sep=',')
        
        #create new position categories
        positions_encode['Defender'] = positions_encode[['RB', 'LB', 'CB']].any(axis=1).astype(int)
        positions_encode['Midfielder'] = positions_encode[['DM', 'CM', 'LM', 'RM', 'AM']].any(axis=1).astype(int)
        positions_encode['Attacker'] = positions_encode[['LW', 'RW', 'FW']].any(axis=1).astype(int)
        
        # Drop original position columns
        positions_encode = positions_encode.drop(columns=['RB', 'LB', 'CB', 'DM', 'CM', 'LM', 'RM', 'LW', 'RW', 'AM', 'FW', 'WB'], 
                                                 errors='ignore')
        
        X = X.drop('Position', axis = 1)
        
        return pd.concat([X.reset_index(drop=True), positions_encode.reset_index(drop=True)], axis=1)

#transformer for determining the season
class SeasonDeterminer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Convert 'kickoff_time' to datetime if not already done
        X['kickoff_time'] = pd.to_datetime(X['kickoff_time'])
        
        # Function to determine the season
        def determine_season(kickoff_time):
            month = kickoff_time.month
            year = kickoff_time.year
            if month >= 8:  # August to December
                return f'{year}-{year + 1}'  # Current year to next year
            else:  # January to July
                return f'{year - 1}-{year}'  # Previous year to current year
        
        # Apply the function to create the 'Season' column
        X['Season'] = X['kickoff_time'].apply(determine_season)
        return X
    
    
#transformer for calculating 'Designated Penalty Taker'
class DesigPenTaker(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        #group observations by player ID and penalties attempted 
        pen_group = X.groupby('Player ID', as_index = False)['Penalties Attempted'].sum()

        #remove obs with 0 penalties attempted 
        pen_group = pen_group[pen_group['Penalties Attempted'] > 0 ]

        #create new dataframe which has 'kickoff_time', 'Season', 'penalties attempted' and 'team in it 
        team_pens = X[['kickoff_time', 'Season', 'Penalties Attempted', 'Team']].copy()

        #now we group by team and season to compute how many penalties were taken by each team in each season
        team_pens_summary = team_pens.groupby(['Season', 'Team'], as_index=False)['Penalties Attempted'].sum()
        team_pens_summary.rename(columns={'Penalties Attempted': 'Team Penalties'}, inplace=True)

        #create empty dataframe
        pen_prop = pd.DataFrame()

        #loop through to get the Player ID and Penalties Attempted for each team in each season, filtering so that we only 
        # include observations with at least 1 penalty taken 
        for index, row in team_pens_summary.iterrows():
            team = row['Team']
            season = row['Season']
    
            filtered = X[(X['Season'] == season) & (X['Team'] == team) & (X['Penalties Attempted'] > 0)][['Player ID', 'Penalties Attempted']]
            filtered['Team'] = team
            filtered['Season'] = season
            pen_prop = pd.concat([pen_prop, filtered], ignore_index= True)

        #adding a new column into pen_prop called 'Team Penalties' which merges the relevant information from team_pens_summary
        pen_prop = pen_prop.merge(team_pens_summary, on=['Team', 'Season'], how='left')

        #we now merge rows that have the same player ID, team and season together. For the rows that satisfy this, we sum the 
        # penalties attempted to reflect the number of penalties a particular player ID took in a given season 
        merged_penprop = pen_prop.groupby(['Team', 'Season', 'Player ID'], as_index=False).agg({
            'Penalties Attempted': 'sum',
            'Team Penalties': 'first'  
        })
        merged_penprop = merged_penprop.sort_values(by='Player ID')

        #adding new column called Proportion of Team Penalties Taken
        merged_penprop['Proportion of Team Penalties Taken'] = (merged_penprop['Penalties Attempted'] / merged_penprop['Team Penalties'])

        #final dataframe which merges the rows based on Player ID. Each row now corresponds to one unique player ID, the penalties
        # attempted and team penalties columns are now summed. The proportion is then recalculated 
        penprop_summary = merged_penprop.groupby('Player ID').agg(
            Penalties_Attempted=('Penalties Attempted', 'sum'),
            Team_Penalties=('Team Penalties', 'sum')
        ).reset_index()

        penprop_summary['Proportion of Team Penalties Taken'] = (
        penprop_summary['Penalties_Attempted'] / penprop_summary['Team_Penalties'])

        #first off, we can probably include all player ID's with 100% team penalties taken as 'designated penalty takers'
        desig_pen_takers = penprop_summary.loc[penprop_summary['Proportion of Team Penalties Taken'] == 1, 'Player ID'].tolist()

        #we now add the Player ID's of players that took more than 50% of their team's penalties 
        additional_takers = penprop_summary.loc[penprop_summary['Proportion of Team Penalties Taken'] > 0.5, 'Player ID'].tolist()
        desig_pen_takers.extend(additional_takers)

        #construct 'Designated Penalty Taker' feature 
        X['Designated Penalty Taker'] = X['Player ID'].isin(desig_pen_takers).astype(int)
        
        #remove 'Penalties Attempted' column
        X = X.drop('Penalties Attempted', axis = 1)
    
        return X


#transformer for calculating 'Rolling xG'
class RollingxG(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        #sort values by Player ID and kickoff_time, we also need to reset the index to ensure that the shifting in the function 
        # below works as intended
        X.sort_values(by=['Player ID', 'kickoff_time'], inplace=True)
        X.reset_index(drop=True, inplace=True)

        #function to calculate rolling xG (past 365 days version)
        def calculate_rolling_xg(row, df):
            player_id = row['Player ID']
            kickoff_time = row['kickoff_time']
    
            # Define the date range
            start_date = kickoff_time - pd.Timedelta(days=365)
            end_date = kickoff_time - pd.Timedelta(days=1)  # exclusive of the kickoff_time
    
            # Filter the DataFrame for the specific player and date range
            player_data = df[(df['Player ID'] == player_id) & 
                            (df['kickoff_time'] >= start_date) & 
                            (df['kickoff_time'] <= end_date)]
    
            # Calculate total xG and number of games
            total_xG = player_data['npxG'].sum()
            number_of_games = player_data.shape[0]
    
            # Calculate average xG per game
            if number_of_games > 0:
                return total_xG / number_of_games
            else:
                return None  # No games played in the period
    

        # Apply the function to create the 'rolling xG' column
        X.loc[:, 'Rolling xG'] = X.apply(lambda row: calculate_rolling_xg(row, X), axis=1)
        
        #drop 'npxG' column
        X = X.drop('npxG', axis = 1)
        
        return X


#transformer for calculating 'Rolling Team xG Matchup'
class RollingxG_Matchup(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    
    def transform(self, X):
        #load in team data file 
        team_finaldat = pd.read_csv('team_finaldat.csv', index_col= 0)

        #drop irrelevant columns
        team_finaldat = team_finaldat.drop(columns = ['Referee', 'Attendance', 'Formation', 'Opposition Formation'])
        
        #we can see that the team data is first grouped by 'Team', but then 'Date' is backwards. Let's amend this. 
        team_finaldat = team_finaldat.sort_values(by=['Team', 'Date'], ascending=[True, True])
        
        #determine_season function, which converts kickoff_time into 'Season
        def determine_season(kickoff_time):
            month = kickoff_time.month
            year = kickoff_time.year
            if month >= 8:  # August to December
                return f'{year}-{year + 1}'  # Current year to next year
            else:  # January to July
                return f'{year - 1}-{year}'  # Previous year to current year
        
        
        #we now want to add in rolling xG and xGA for teams. We can try reuse the functions used previously, to do this we need 
        # to add the 'Season' feature to the team_finaldat dataframe
        team_finaldat['Date'] = pd.to_datetime(team_finaldat['Date'])
        team_finaldat['Season'] = team_finaldat['Date'].apply(determine_season)

        #sort values by Team and Date, we also need to reset the index to ensure that the shifting in the function below works as intended
        team_finaldat.sort_values(by=['Team', 'Date'], inplace=True)
        team_finaldat.reset_index(drop=True, inplace=True)

        #function to calculate rolling xG
        def calculate_rolling_teamxg(group):
            # Calculate the cumulative sum and the number of games played
            cumulative_sum = group['xG'].cumsum()
            count = pd.Series(range(1, len(group) + 1), index=group.index)
    
            # Create a new Series for rolling xG
            rolling_xg = cumulative_sum.shift(1)/count.shift(1)
    
            return rolling_xg

        #function to calculate rolling xGA
        def calculate_rolling_teamxga(group):
            # Calculate the cumulative sum and the number of games played
            cumulative_sum = group['xGA'].cumsum()
            count = pd.Series(range(1, len(group) + 1), index=group.index)
    
            # Create a new Series for rolling xG
            rolling_xg = cumulative_sum.shift(1)/count.shift(1)
    
            return rolling_xg

        #apply function to get rolling xG and xGA for each team 
        team_finaldat['Team Rolling xG'] = team_finaldat.groupby(['Team', 'Season']).apply(calculate_rolling_teamxg, 
                                                                                include_groups = False).reset_index(drop = True)
        team_finaldat['Team Rolling xGA'] = team_finaldat.groupby(['Team', 'Season']).apply(calculate_rolling_teamxga, 
                                                                                include_groups = False).reset_index(drop = True)
        
        #create xg/xga diff feature 
        team_finaldat['Team xG Difference'] = team_finaldat['xG'] - team_finaldat['xGA']
        
        #function to calculate rolling xg diff
        def calculate_rolling_teamxgdiff(group):
            # Calculate the cumulative sum and the number of games played
            cumulative_sum = group['Team xG Difference'].cumsum()
            count = pd.Series(range(1, len(group) + 1), index=group.index)
    
            # Create a new Series for rolling xG
            rolling_xg = cumulative_sum.shift(1)/count.shift(1)
    
            return rolling_xg

        team_finaldat['Team Rolling xG Difference'] = team_finaldat.groupby(['Team', 'Season']).apply(calculate_rolling_teamxgdiff, 
                                                                                include_groups = False).reset_index(drop = True)
        
        #merge original dataframe with team data
        merged_df = X.merge(
            team_finaldat[['Season', 'Venue', 'Team', 'Opponent', 'Team Rolling xG', 'Team Rolling xGA', 'Team Rolling xG Difference']],
            on=['Season', 'Venue', 'Team', 'Opponent'],
            how='left'  
        )

        X['Team Rolling xG'] = merged_df['Team Rolling xG']
        X['Team Rolling xGA'] = merged_df['Team Rolling xGA']
        X['Team Rolling xG Difference'] = merged_df['Team Rolling xG Difference']
        
        #now, calculate rolling xG team matchups
        #create the new feature, lets call it 'Team Rolling xG Matchups' 
        X['Team Rolling xG Matchup'] = None

        #group by 'Team', 'Opponent', 'Season', and 'Venue'
        for (team, opponent, season, venue), group in X.groupby(['Team', 'Opponent', 'Season', 'Venue']):
            #initialize none values
            team_xgdiff = None
            opp_xgdiff = None

            #get team xg diff
            if group['Team Rolling xG Difference'].nunique() == 1:
                team_xgdiff = group['Team Rolling xG Difference'].iloc[0]

            #get opponent rows
            opponent_row = X[
                (X['Team'] == opponent) &
                (X['Opponent'] == team) &
                (X['Season'] == season) &
                (X['Venue'] != venue)  # Ensure venue is opposite
            ].reset_index(drop=True)

            #make sure opponent rows have the same team rolling xG diff, if so select any
            if opponent_row['Team Rolling xG Difference'].nunique() == 1:
                opp_xgdiff = opponent_row['Team Rolling xG Difference'].iloc[0]
            #calculate xg matchup val
            if team_xgdiff is not None and opp_xgdiff is not None:
                xg_matchup = team_xgdiff - opp_xgdiff
            else:
                xg_matchup = None  # handle cases where values are not available

            #assign the calculated value back to the original DataFrame
            X.loc[group.index, 'Team Rolling xG Matchup'] = xg_matchup
        
        #drop the unnecessary columns
        X = X.drop(['Team Rolling xGA', 'Team Rolling xG Difference'], axis = 1)
            
        return X
    
    
#transformer for calculating 'Rolling Shots on Target'
class RollingSOT(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        #sort values by Player ID and kickoff_time, we also need to reset the index to ensure that the shifting in the function 
        # below works as intended
        X.sort_values(by=['Player ID', 'kickoff_time'], inplace=True)
        X.reset_index(drop=True, inplace=True)

        #function to calculate rolling xG (past 365 days version)
        def calculate_rolling_sot(row, df):
            player_id = row['Player ID']
            kickoff_time = row['kickoff_time']
    
            # Define the date range
            start_date = kickoff_time - pd.Timedelta(days=365)
            end_date = kickoff_time - pd.Timedelta(days=1)  # exclusive of the kickoff_time
    
            # Filter the DataFrame for the specific player and date range
            player_data = df[(df['Player ID'] == player_id) & 
                            (df['kickoff_time'] >= start_date) & 
                            (df['kickoff_time'] <= end_date)]
    
            # Calculate total xG and number of games
            total_xG = player_data['Shots on Target'].sum()
            number_of_games = player_data.shape[0]
    
            # Calculate average xG per game
            if number_of_games > 0:
                return total_xG / number_of_games
            else:
                return None  # No games played in the period
    

        # Apply the function to create the 'rolling xG' column
        X.loc[:, 'Rolling Shots on Target'] = X.apply(lambda row: calculate_rolling_sot(row, X), axis=1)
        
        #drop 'Shots on Target' column
        X = X.drop('Shots on Target', axis = 1)
        
        return X
    
#transformer for calculating 'Rolling Penalty Area Touches'
class RollingPAT(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        #sort values by Player ID and kickoff_time, we also need to reset the index to ensure that the shifting in the function 
        # below works as intended
        X.sort_values(by=['Player ID', 'kickoff_time'], inplace=True)
        X.reset_index(drop=True, inplace=True)

       #function to calculate rolling xG (past 365 days version)
        def calculate_rolling_pat(row, df):
            player_id = row['Player ID']
            kickoff_time = row['kickoff_time']
    
            # Define the date range
            start_date = kickoff_time - pd.Timedelta(days=365)
            end_date = kickoff_time - pd.Timedelta(days=1)  # exclusive of the kickoff_time
    
            # Filter the DataFrame for the specific player and date range
            player_data = df[(df['Player ID'] == player_id) & 
                            (df['kickoff_time'] >= start_date) & 
                            (df['kickoff_time'] <= end_date)]
    
            # Calculate total xG and number of games
            total_xG = player_data['Penalty Area Touches'].sum()
            number_of_games = player_data.shape[0]
    
            # Calculate average xG per game
            if number_of_games > 0:
                return total_xG / number_of_games
            else:
                return None  # No games played in the period
    

        # Apply the function to create the 'rolling xG' column
        X.loc[:, 'Rolling Penalty Area Touches'] = X.apply(lambda row: calculate_rolling_pat(row, X), axis=1)
        
        #drop 'Penalty Area Touches' column
        X = X.drop('Penalty Area Touches', axis = 1)
        
        return X

#transformer to converts 'Home' to 0 and 'Away' to 1 in the 'Venue' column
class encodeVenue(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        # Ensure 'Venue' column exists in the DataFrame
        if 'Venue' in X.columns:
            # Replace 'Home' with 0 and 'Away' with 1
            X['Venue'] = X['Venue'].replace({'Home': 0, 'Away': 1})
        return X


#transformer that drops the columns in the dataframe that we needed for feature transformation, but we don't need for model fitting
class DropFinal(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None):
        return self 
    def transform(self, X):
        X = X.drop(['Player ID', 'Team', 'Opponent', 'kickoff_time', 'Season'], axis = 1)
        return X
    

#transformer that drops the rows that have NaNs
class DropRow(BaseEstimator, TransformerMixin):
    def fit (self, X, y = None):
        return self 
    def transform(self, X):
        X = X.dropna()
        return X

        

#FunctionTransformer to select the necessary columns
select_transformer = FunctionTransformer(select_columns)

# Create a Pipeline
pipe = Pipeline(steps=[
    ('select', select_transformer), 
    ('drop_empty_pos', DropEmptyPositions()), 
    ('encode_positions', PositionEncoder()), 
    ('determine_season', SeasonDeterminer()), 
    ('desig_pen_taker', DesigPenTaker()), 
    ('rolling_xg', RollingxG()), 
    ('rolling_sot', RollingSOT()),
    ('rolling_pat', RollingPAT()),
    ('rolling_xg_matchup', RollingxG_Matchup()),
    ('encodeVenue', encodeVenue()), 
    ('drop_final', DropFinal()), 
    ('droprow', DropRow())
])

In [20]:
#use pipeline to transform the DataFrame
att_train_processed = pipe.fit_transform(att_train)
att_test_processed = pipe.fit_transform(att_test)

  X['Venue'] = X['Venue'].replace({'Home': 0, 'Away': 1})
  X['Venue'] = X['Venue'].replace({'Home': 0, 'Away': 1})


In [22]:
att_test_processed.head()

Unnamed: 0,Venue,Goals,Minutes Played,Defender,Midfielder,Attacker,Designated Penalty Taker,Rolling xG,Rolling Shots on Target,Rolling Penalty Area Touches,Team Rolling xG,Team Rolling xG Matchup
1,0,0,72,0,0,1,0,0.8,1.0,2.0,1.604762,0.653209
2,1,1,72,0,0,1,0,0.65,1.5,2.5,1.803846,1.438462
3,1,0,25,0,0,1,0,0.766667,1.333333,2.666667,1.837037,0.37599
4,1,0,6,0,0,1,0,0.6,1.0,2.25,2.25,0.15
6,0,0,68,0,0,1,0,0.36,1.0,2.2,1.663636,0.413636


In [23]:
#split the above datasets into feature and target sets

att_train_x = att_train_processed.drop('Goals', axis = 1)
att_train_y = att_train_processed['Goals']

att_test_x = att_test_processed.drop('Goals', axis = 1)
att_test_y = att_test_processed['Goals']

# Model Training

We are now in a position to begin model training. We will treat this problem as a regression problem, where we want to predict the number of goals scored for a particular player in a particular game as a continuous number (e.g. we may predict x player to score 0.3 goals in x game). As we are treating this as a regression problem, we will use the RMSE (root mean square error) as the score used to evaluate and compare the performance of competing models.

In [17]:
from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())

In [28]:
#linear regression
from sklearn.linear_model import LinearRegression

linreg_classifier = LinearRegression()
linreg_scores = cross_val_score(linreg_classifier, att_train_x, att_train_y, cv = 10, scoring = "neg_root_mean_squared_error")
display_scores(linreg_scores)

Scores: [-0.38641969 -0.36030985 -0.38296061 -0.38685603 -0.39624085 -0.39368591
 -0.3746901  -0.44459941 -0.40338992 -0.39929472]
Mean: -0.39284470866088456
