# Feature Engineering

As mentioned in the end of the last notebook, we want to enrich our dataset (which currently consists only of single-gameweek data) by adding time-based cumulative features which can provide greater insight regarding player form.

Before we begin, let's look at the features we currently have:

In [1]:
# imports
import pandas as pd

In [2]:
# current features
processed_24_25 = pd.read_csv('../data/processed/2024-25/processed_merged_gws.csv')
current_features = processed_24_25.columns

current_features

Index(['element', 'name', 'position', 'GW', 'total_points', 'value', 'minutes',
       'expected_goals', 'expected_assists', 'expected_goals_conceded',
       'goals_scored', 'assists', 'goals_conceded', 'clean_sheets',
       'ict_index', 'fixture', 'was_home', 'fixture_difficulty'],
      dtype='object')

Now, let's identify the features that we can potentially engineer and enhance:
- **total_points**: this is the player's individual gameweek score. We can enhance this feature by taking the player's total or average gameweek score over the past few weeks.
- We can do the same for **minutes played**, **goals and assists**, **expected goals and assists**, **goals conceded**, **expected goals conceded**, and **clean sheets**.

We will choose a sliding time window of **3 gameweeks** as an initial attempt.

## Features to be added

- **avg_score_last_3**: Average gameweek score over the last 3 gameweeks.

- **avg_mins_last_3**: Average minutes played over the last 3 gameweeks.

- **goals_last_3**: Total goals scored over the last 3 gameweeks.

- **assists_last_3**: Total assists over the last 3 gameweeks.

- **xG_last_3**: Total expected goals over the last 3 gameweeks.

- **xA_last_3**: Total expected assists over the last 3 gameweeks.

- **goals_conceded_last_3**: Total goals conceded over the last 3 gameweeks.

- **avg_xGC_last_3**: Average expected goals conceded over the last 3 gameweeks.

- **clean_sheets_last_3**: Total clean sheets over the last 3 gameweeks.

In [3]:
# helper function to calculate rolling features
def calculate_rolling_features(df, window=3, min_p=1):
    grouped = df.groupby('element')

    # roll statistics
    df['avg_score_last_3'] = (
        grouped['total_points'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).mean()
        ).reset_index(level=0, drop=True)
    )

    df['avg_mins_last_3'] = (
        grouped['minutes'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).mean()
        ).reset_index(level=0, drop=True)
    )

    df['goals_last_3'] = (
        grouped['goals_scored'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    df['assists_last_3'] = (
        grouped['assists'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    df['xG_last_3'] = (
        grouped['expected_goals'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    df['xA_last_3'] = (
        grouped['expected_assists'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    df['goals_conceded_last_3'] = (
        grouped['goals_conceded'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    df['avg_xGC_last_3'] = (
        grouped['expected_goals_conceded'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).mean()
        ).reset_index(level=0, drop=True)
    )

    df['clean_sheets_last_3'] = (
        grouped['clean_sheets'].apply(
            lambda x: x.shift(1).rolling(window, min_periods=min_p).sum()
        ).reset_index(level=0, drop=True)
    )

    return df

In [4]:
# general function to faciliate feature engineering process
def feature_engineering(df):
    # sort by player and GW to ensure proper rolling calculations
    df = df.sort_values(by=['element', 'GW'])

    # main rolling calculations
    fe_df = calculate_rolling_features(df)

    # drop invalid rows (with NaN values)
    fe_df = fe_df.dropna()
    
    return fe_df

### Now, time to put everything together and enhance our data
### 2022-23

In [5]:
# load in processed data 
processed_merged_gws_22_23 = pd.read_csv('../data/processed/2022-23/processed_merged_gws.csv')

# transform
feature_engineered_22_23 = feature_engineering(processed_merged_gws_22_23)

# output
feature_engineered_22_23.to_csv('../data/processed/2022-23/feature_engineered_22_23.csv', index=False)

### 2023-24

In [6]:
# load in processed data 
processed_merged_gws_23_24 = pd.read_csv('../data/processed/2023-24/processed_merged_gws.csv')

# transform
feature_engineered_23_24 = feature_engineering(processed_merged_gws_23_24)

# output
feature_engineered_23_24.to_csv('../data/processed/2023-24/feature_engineered_23_24.csv', index=False)

### 2024-25

In [7]:
# load in processed data 
processed_merged_gws_24_25 = pd.read_csv('../data/processed/2024-25/processed_merged_gws.csv')

# transform
feature_engineered_24_25 = feature_engineering(processed_merged_gws_24_25)

# output
feature_engineered_24_25.to_csv('../data/processed/2024-25/feature_engineered_24_25.csv', index=False)