# Data Processing

We now know what shape we want our data to be in:
- player ID (as in ```player_idlist.csv```)
- player name (full name to avoid ambiguities)
- position (GK/DEF/MID/FWD)
- gameweek 
- actual FPL points scored during gameweek
- player value (price)
- minutes played
- expected goals
- expected assists
- expected goals conceded
- goals scored
- assists
- goals conceded
- clean sheets
- ict index
- fixture difficulty

**We want one big csv file PER SEASON.**

...let's get to work!

In [1]:
# imports
import pandas as pd

## 2022-23

In [2]:
# load in raw data
merged_gws_22_23 = pd.read_csv('../data/raw/2022-23/gws/merged_gw.csv')

# omit all data prior to gameweek 16 (as there was no expected goal data before that)
merged_gws_22_23 = merged_gws_22_23[merged_gws_22_23['GW'] > 15]

# pick out useful subset of data
merged_gws_22_23 = merged_gws_22_23[
    [
        'element',      # player ID
        'name',
        'position',
        'GW',
        'total_points',
        'value',
        'minutes',
        'expected_goals',
        'expected_assists',
        'expected_goals_conceded',
        'goals_scored',
        'assists',
        'goals_conceded',
        'clean_sheets',
        'ict_index',
        'fixture',
        'was_home'      # we will use 'fixture' and 'was_home' to retrieve fixture difficulty
    ]
]

In [3]:
# here we write code to extract fixture difficulty based on the columns 'fixture' and 'was_home'
fixtures_22_23 = pd.read_csv('../data/raw/2022-23/fixtures.csv')
fixtures_dict_22_23 = fixtures_22_23[['id', 'team_h_difficulty', 'team_a_difficulty']].set_index('id').T.to_dict()

def add_fixture_difficulty(row, fixtures_dict):
    fixture_id = row['fixture']
    if row['was_home']:
        team = 'team_h_difficulty'
    else:
        team = 'team_a_difficulty'
    row['fixture_difficulty'] = fixtures_dict[fixture_id][team]

    return row

merged_gws_22_23 = merged_gws_22_23.apply(lambda row: add_fixture_difficulty(row=row, fixtures_dict=fixtures_dict_22_23), axis=1)

# save processed data
merged_gws_22_23.to_csv('../data/processed/2022-23/processed_merged_gws.csv', index=False)

## 2023-24

In [4]:
# load in raw data
merged_gws_23_24 = pd.read_csv('../data/raw/2023-24/gws/merged_gw.csv')

# pick out useful subset of data
merged_gws_23_24 = merged_gws_23_24[
    [
        'element',      # player ID
        'name',
        'position',
        'GW',
        'total_points',
        'value',
        'minutes',
        'expected_goals',
        'expected_assists',
        'expected_goals_conceded',
        'goals_scored',
        'assists',
        'goals_conceded',
        'clean_sheets',
        'ict_index',
        'fixture',
        'was_home'      # we will use 'fixture' and 'was_home' to retrieve fixture difficulty
    ]
]

# here we write code to extract fixture difficulty based on the columns 'fixture' and 'was_home'
fixtures_23_24 = pd.read_csv('../data/raw/2023-24/fixtures.csv')
fixtures_dict_23_24 = fixtures_23_24[['id', 'team_h_difficulty', 'team_a_difficulty']].set_index('id').T.to_dict()

merged_gws_23_24 = merged_gws_23_24.apply(lambda row: add_fixture_difficulty(row=row, fixtures_dict=fixtures_dict_23_24), axis=1)

# save processed data
merged_gws_23_24.to_csv('../data/processed/2023-24/processed_merged_gws.csv', index=False)


## 2024-25

In [5]:
# load in raw data
merged_gws_24_25 = pd.read_csv('../data/raw/2024-25/gws/merged_gw.csv')

# pick out useful subset of data
merged_gws_24_25 = merged_gws_24_25[
    [
        'element',      # player ID
        'name',
        'position',
        'GW',
        'total_points',
        'value',
        'minutes',
        'expected_goals',
        'expected_assists',
        'expected_goals_conceded',
        'goals_scored',
        'assists',
        'goals_conceded',
        'clean_sheets',
        'ict_index',
        'fixture',
        'was_home'      # we will use 'fixture' and 'was_home' to retrieve fixture difficulty
    ]
]

# here we write code to extract fixture difficulty based on the columns 'fixture' and 'was_home'
fixtures_24_25 = pd.read_csv('../data/raw/2024-25/fixtures.csv')
fixtures_dict_24_25 = fixtures_24_25[['id', 'team_h_difficulty', 'team_a_difficulty']].set_index('id').T.to_dict()

merged_gws_24_25 = merged_gws_24_25.apply(lambda row: add_fixture_difficulty(row=row, fixtures_dict=fixtures_dict_24_25), axis=1)

# save processed data
merged_gws_24_25.to_csv('../data/processed/2024-25/processed_merged_gws.csv', index=False)


## Moving On: Model Development

Now that we have extracted the useful parts of the raw data, we can move on with the process of developing our model.

**Model Selection**: For this project, we will set out to develop two models, one using a simple linear regression and one using XGBoost, and compare the performances of the two models. 

We will proceed in the following order:
- **Continued Feature Engineering**: In addition to single gameweek data, we might want to create time-based cumulative features, such as ```total points in past 3 weeks```, ```minutes played in past 3 weeks```, or ```expected goal involvements in past 3 weeks```. These features may provide greater insight regarding player form.

- **Feature Selection**: We may want to perform correlation analysis to evaluate feature importance, as well as finding features that have high correlation with each other (so we can remove redundant dimensions).

- **Feature Scaling and Encoding**: As we are going to use a linear model (as one of our models), we will have to perform feature scaling and encoding, which includes data normalization, one-hot encoding, and more.

- **Splitting Data for Training and Evaluation**: A key part to devleoping any type of machine learning model. We will split our data into training & testing (and depending on performance needs, validation) sets.

- **Model Training & Evaluation**: Finally, we will train our models and evaluate them using metrics such as MSE. Depending on how good our models are, we may want to revise some of the previous steps mentioned and make improvements.