# Goal

The goal is to take in a series of inputs for each player available for purchase in FPL -> and turn that into a prediction for their points for the gameweek. 

# What is needed?

In order to generate an expected point value for a player, we need data about players and what they scored each week. <br>

It does not seem like this sort of information is being saved anywhere. As such, the first phase of this project will be setting up the pipeline to collect this data each gameweek. We will want to collect a bunch of information from a few different sources, things like percentage of minutes played, xG Per 90, xA per 90, "threat", "influence, "creativity (those 3 being FPL generated metrics), opposition xG conceded, home or away, etc. <br>

We will want to be able to collect this weekly as a snapshot BEFORE the matches are played. After they are played, we will append a "points_scored" to each record. Eventually we aim to be able to predict this points scored value give all the data we collect, but we need the data in the week-by-week format in order to do this. 

# Phase 1: Week-by-week Historical Data Collection

## 1) Data sources and desired attributes

Here I will outline the specific data sources I am going to pull from, and what data I want. 

### Fbref

Think of this site as providing data from two perspectives: team and individual. <br>

As for team data, we want to have attributes that give an idea of how the individual's team is performing, but also how the team they are playing against is performing. Therefore..

- all expected stats per 90 minutes FOR (don't even pull goals and assists, I just care about expected). We will use this to see how good of an attacking team this player is playing for, and how bad of an attacking team they are playing against
- all expected stats per 90 minutes AGAINST (tells us how good or bad of a defense this player plays for or is up against)

And for the individual perspective:

- percentage of minutes played this season - "min%" (is the player playing a lot?)
- expected stats per 90 (how effective is this player attacking-wise?)
- tackle + challenge + blocks, per 90 data (how effective is this player defensively?)
- yellow/ red cards per 90 (these actions lose points, so we want to know about them)
- penalty share, a number between 0 and 1 (we want to know if a player is their team's penalty kick taker, as this is a good way to get points)

We will also get all the scheduling information out of this site. 

### Official fantasy premier league site

We want to know some stuff as it relates to the game itself. These include:

- price and selection %, won't really assist in predicting points (or rather we don't want to use them for that) but will come in handy for later functionality with the model, like picking differentials and building a squad
- FORM - very important. We want to know how this player is performing coming into the gameweek
- finally, actual points scored.

Remember, these are all snapshot statistics - we want to know what these values were before the gameweek, and after the gameweek, we want to append the points scored to each record. 

### Proposed workflow

1) A script runs to start to fill out the games to be played in the next gameweek. It fills in a record for each player, with the gameweek, individual's team, and opposition.

2) We then access the Fbref data source in order to get team and opposition data. Basically, we will match on the player's team first, getting expected data both for and against - then we repeat the process for the opposition.

3) Now, we have the player, who they are playing, and data about how their team is performing per 90 and how their opposition is performing per 90 up to this point in the season. We should now attach all the data from the player perspective to each row. Get all the per 90 data. This should all be quite simple except for the penalty kick share, which will require a simple calculation to see what perfecntage of a team's penalty kicks the player has taken.

4) Now, join in the data from the official FPL website. Match based on player name, and grab price, % selection, form, and the column "points_scored" but leave this BLANK (we will not know it at the time this script runs).

5) We will let the game week happen, then run the script that gets player points for the week from the official FPL site. Join this in based on player name to the records we just created, using matchweek and player name as the combined key. 

In [73]:
import soccerdata as sd
import pandas as pd
from datetime import datetime
import requests
from thefuzz import process

def get_fixtures(week_wanted):
    """
    grabs the list of games for the week, extracts only the cleaned team names of home and away team, as well as match_week, 
    """
    fbref = sd.FBref(leagues='ENG-Premier League', seasons='2025-2026')
    schedule = fbref.read_schedule()
    schedule['date'] = pd.to_datetime(schedule['date'], errors='coerce')
    schedule = schedule[schedule['week'] == week_wanted]

    return schedule[['home_team','away_team','week']]


def get_fbref_player_stats(season='2025-2026'):
    """
    grabs all player individual statistics that we want
    """
    fbref = sd.FBref('ENG-Premier League', season)

    standard = fbref.read_player_season_stats(stat_type="standard")
    shooting = fbref.read_player_season_stats(stat_type="shooting")
    passing = fbref.read_player_season_stats(stat_type="passing")
    defense = fbref.read_player_season_stats(stat_type="defense")
    playing_time = fbref.read_player_season_stats(stat_type="playing_time")

    def flatten_cols(df):
        df = df.copy()
        df.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in df.columns.values]
        return df

    standard = flatten_cols(standard)
    shooting = flatten_cols(shooting)
    passing = flatten_cols(passing)
    defense = flatten_cols(defense)
    playing_time = flatten_cols(playing_time)

    for df in [standard, shooting, passing, defense, playing_time]:
        df.reset_index(inplace=True)
        df.rename(columns={'index': 'player'}, inplace=True)

    metadata_cols = ['season', 'league', 'team', 'nation_', 'pos_', 'age_', 'born_']
    for df in [standard, shooting, passing, defense]:
        df.drop(columns=[c for c in metadata_cols if c in df.columns], inplace=True)

    fbref_stats = standard
    for df in [shooting, passing, defense, playing_time]:
        fbref_stats = fbref_stats.merge(df, on='player', how='outer')

    
    fbref_stats['Tackles_Tkl_per90'] = fbref_stats['Tackles_Tkl'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['Blocks_Blocks_per90'] = fbref_stats['Blocks_Blocks'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['yellow_per90'] = fbref_stats['Performance_CrdY'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['red_per90'] = fbref_stats['Performance_CrdR'] / fbref_stats['Playing Time_90s_y']

    return fbref_stats

def get_teams():
    """
    grabs team statistics at this point in time, for each team
    """
    return None

def get_players():
    """
    Grabs a list of all FPL players
    """
    url = "https://fantasy.premierleague.com/api/bootstrap-static/"
    response = requests.get(url)
    data = response.json()
    
    players = pd.DataFrame(data['elements'])
    teams = {team['id']: team['name'] for team in data['teams']}
    players['team_name'] = players['team'].map(teams)
    
    positions = {pos['id']: pos['singular_name'] for pos in data['element_types']}
    players['position'] = players['element_type'].map(positions)
    
    players_df = players[['id', 'first_name', 'second_name', 'team_name', 'position', 'now_cost']]
    players_df['full_name'] = players_df['first_name'] + " " + players_df['second_name']

    return players_df

def fuzzy_match(fpl_df, fbref_df, threshold=90):
    """
    Fuzzy matches FPL players to FBref player stats by name
    """

    fbref_names = fbref_df['player'].tolist()
    fpl_names = fpl_df['full_name'].tolist()

    mapping = {}
    for name in fpl_names:
        match, score = process.extractOne(name, fbref_names)
        if score >= threshold:
            mapping[name] = match
        else:
            mapping[name] = None

    fpl_df['fbref_name'] = fpl_df['full_name'].map(mapping)

    merged = fpl_df.merge(fbref_df, left_on='fbref_name', right_on='player', how='left')

    return merged


In [None]:
# TODO

# get_teams() to get team stats for players

# and tidy up the fuzzy match below ..

In [74]:
df_fpl = get_players()
df_fbref = get_fbref_player_stats()

df = fuzzy_match(df_fpl,df_fbref)

In [77]:
df.head(50)

Unnamed: 0,id,first_name,second_name,team_name,position,now_cost,full_name,fbref_name,player,Playing Time_MP_x,Playing Time_Starts,Playing Time_Min_x,Playing Time_90s_x,Performance_Gls,Performance_Ast,Performance_G+A,Performance_G-PK,Performance_PK,Performance_PKatt,Performance_CrdY,Performance_CrdR,Expected_xG_x,Expected_npxG_x,Expected_xAG,Expected_npxG+xAG,Progression_PrgC,Progression_PrgP,Progression_PrgR,Per 90 Minutes_Gls,Per 90 Minutes_Ast,Per 90 Minutes_G+A,Per 90 Minutes_G-PK,Per 90 Minutes_G+A-PK,Per 90 Minutes_xG,Per 90 Minutes_xAG,Per 90 Minutes_xG+xAG,Per 90 Minutes_npxG,Per 90 Minutes_npxG+xAG,90s__x,Standard_Gls,Standard_Sh,Standard_SoT,Standard_SoT%,Standard_Sh/90,Standard_SoT/90,Standard_G/Sh,Standard_G/SoT,Standard_Dist,Standard_FK,Standard_PK,Standard_PKatt,Expected_xG_y,Expected_npxG_y,Expected_npxG/Sh,Expected_G-xG,Expected_np:G-xG,90s__y,Total_Cmp,Total_Att,Total_Cmp%,Total_TotDist,Total_PrgDist,Short_Cmp,Short_Att,Short_Cmp%,Medium_Cmp,Medium_Att,Medium_Cmp%,Long_Cmp,Long_Att,Long_Cmp%,Ast_,xAG_,Expected_xA,Expected_A-xAG,KP_,1/3_,PPA_,CrsPA_,PrgP_,90s_,Tackles_Tkl,Tackles_TklW,Tackles_Def 3rd,Tackles_Mid 3rd,Tackles_Att 3rd,Challenges_Tkl,Challenges_Att,Challenges_Tkl%,Challenges_Lost,Blocks_Blocks,Blocks_Sh,Blocks_Pass,Int_,Tkl+Int_,Clr_,Err_,league,season,team,nation_,pos_,age_,born_,Playing Time_MP_y,Playing Time_Min_y,Playing Time_Mn/MP,Playing Time_Min%,Playing Time_90s_y,Starts_Starts,Starts_Mn/Start,Starts_Compl,Subs_Subs,Subs_Mn/Sub,Subs_unSub,Team Success_PPM,Team Success_onG,Team Success_onGA,Team Success_+/-,Team Success_+/-90,Team Success_On-Off,Team Success (xG)_onxG,Team Success (xG)_onxGA,Team Success (xG)_xG+/-,Team Success (xG)_xG+/-90,Team Success (xG)_On-Off,Tackles_Tkl_per90,Blocks_Blocks_per90,yellow_per90,red_per90
0,1,David,Raya Martín,Arsenal,Goalkeeper,59,David Raya Martín,David Raya,David Raya,11.0,11.0,990.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.1,0.1,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,11.0,264.0,380.0,69.5,7295.0,5298.0,56.0,56.0,100.0,134.0,136.0,98.5,74.0,188.0,39.4,0.0,0.1,0.1,-0.1,1.0,26.0,1.0,0.0,4.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0,ENG-Premier League,2526.0,Arsenal,ESP,GK,30-057,1995.0,11.0,990.0,90.0,100.0,11.0,11.0,90.0,11.0,0.0,,0.0,2.36,20.0,5.0,15.0,1.36,,18.8,6.0,12.8,1.16,,0.0,0.0,0.090909,0.0
1,2,Kepa,Arrizabalaga Revuelta,Arsenal,Goalkeeper,42,Kepa Arrizabalaga Revuelta,Kepa Arrizabalaga,Kepa Arrizabalaga,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ENG-Premier League,2526.0,Arsenal,ESP,GK,31-039,1994.0,0.0,,,,,0.0,,0.0,0.0,,11.0,,,,,,,,,,,,,,,
2,3,Karl,Hein,Arsenal,Goalkeeper,40,Karl Hein,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,4,Tommy,Setford,Arsenal,Goalkeeper,40,Tommy Setford,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,5,Gabriel,dos Santos Magalhães,Arsenal,Defender,66,Gabriel dos Santos Magalhães,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,6,William,Saliba,Arsenal,Defender,60,William Saliba,William Saliba,William Saliba,10.0,9.0,724.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,0.3,0.1,0.3,4.0,36.0,3.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.04,0.03,0.04,8.0,0.0,2.0,0.0,0.0,0.25,0.0,0.0,,10.4,0.0,0.0,0.0,0.3,0.3,0.13,-0.3,-0.3,8.0,669.0,711.0,94.1,11559.0,3293.0,281.0,287.0,97.9,350.0,366.0,95.6,32.0,45.0,71.1,0.0,0.1,0.4,-0.1,1.0,35.0,1.0,0.0,36.0,8.0,10.0,4.0,6.0,4.0,0.0,4.0,6.0,66.7,2.0,4.0,4.0,0.0,3.0,13.0,39.0,0.0,ENG-Premier League,2526.0,Arsenal,FRA,DF,24-232,2001.0,10.0,724.0,72.0,73.1,8.0,9.0,75.0,7.0,1.0,45.0,0.0,2.3,17.0,3.0,14.0,1.74,1.4,14.8,4.5,10.4,1.29,0.47,1.25,0.5,0.0,0.0
6,7,Riccardo,Calafiori,Arsenal,Defender,58,Riccardo Calafiori,Riccardo Calafiori,Riccardo Calafiori,11.0,11.0,856.0,9.5,1.0,2.0,3.0,1.0,0.0,0.0,3.0,0.0,2.2,2.2,0.5,2.7,19.0,33.0,40.0,0.11,0.21,0.32,0.11,0.32,0.23,0.05,0.28,0.23,0.28,9.5,1.0,19.0,3.0,15.8,2.0,0.32,0.05,0.33,14.4,0.0,0.0,0.0,2.2,2.2,0.12,-1.2,-1.2,9.5,385.0,474.0,81.2,6237.0,1289.0,190.0,211.0,90.0,167.0,198.0,84.3,18.0,42.0,42.9,2.0,0.5,0.2,1.5,4.0,27.0,6.0,3.0,33.0,9.5,13.0,9.0,7.0,3.0,3.0,8.0,12.0,66.7,4.0,7.0,1.0,6.0,5.0,18.0,31.0,0.0,ENG-Premier League,2526.0,Arsenal,ITA,DF,23-176,2002.0,11.0,856.0,78.0,86.5,9.5,11.0,78.0,4.0,0.0,,0.0,2.36,16.0,5.0,11.0,1.16,-1.53,16.9,4.7,12.2,1.28,0.89,1.368421,0.736842,0.315789,0.0
7,8,Jurriën,Timber,Arsenal,Defender,61,Jurriën Timber,Jurriën Timber,Jurriën Timber,11.0,10.0,882.0,9.8,2.0,1.0,3.0,2.0,0.0,0.0,2.0,0.0,2.0,2.0,0.6,2.5,16.0,59.0,47.0,0.2,0.1,0.31,0.2,0.31,0.2,0.06,0.26,0.2,0.26,9.8,2.0,11.0,5.0,45.5,1.12,0.51,0.18,0.4,9.1,0.0,0.0,0.0,2.0,2.0,0.19,0.0,0.0,9.8,413.0,492.0,83.9,6487.0,2024.0,194.0,215.0,90.2,203.0,232.0,87.5,10.0,24.0,41.7,1.0,0.6,0.7,0.4,11.0,46.0,14.0,1.0,59.0,9.8,34.0,18.0,16.0,11.0,7.0,11.0,15.0,73.3,4.0,9.0,0.0,9.0,6.0,40.0,26.0,0.0,ENG-Premier League,2526.0,Arsenal,NED,DF,24-147,2001.0,11.0,882.0,80.0,89.1,9.8,10.0,86.0,8.0,1.0,20.0,0.0,2.36,17.0,5.0,12.0,1.22,-1.28,16.6,4.9,11.7,1.19,0.28,3.469388,0.918367,0.204082,0.0
8,9,Jakub,Kiwior,Arsenal,Defender,54,Jakub Kiwior,Jakub Kiwior,Jakub Kiwior,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ENG-Premier League,2526.0,Arsenal,POL,"DF,MF",25-269,2000.0,0.0,,,,,0.0,,0.0,0.0,,1.0,,,,,,,,,,,,,,,
9,10,Myles,Lewis-Skelly,Arsenal,Defender,51,Myles Lewis-Skelly,Myles Lewis-Skelly,Myles Lewis-Skelly,7.0,0.0,92.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,1.0,56.0,62.0,90.3,954.0,126.0,26.0,26.0,100.0,27.0,30.0,90.0,3.0,4.0,75.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,100.0,0.0,2.0,0.0,2.0,1.0,2.0,3.0,0.0,ENG-Premier League,2526.0,Arsenal,ENG,DF,19-046,2006.0,7.0,92.0,13.0,9.3,1.0,0.0,,0.0,7.0,13.0,4.0,3.0,3.0,0.0,3.0,2.93,1.73,1.6,1.0,0.7,0.66,-0.56,1.0,2.0,1.0,0.0


In [67]:
df = df[['player','Playing Time_Min%','Per 90 Minutes_xG','Per 90 Minutes_xAG','Tackles_Tkl_per90','Blocks_Blocks_per90','yellow_per90','red_per90']]

In [68]:
df.head()

Unnamed: 0,player,Playing Time_Min%,Per 90 Minutes_xG,Per 90 Minutes_xAG,Tackles_Tkl_per90,Blocks_Blocks_per90,yellow_per90,red_per90
599,Viktor Gyökeres,80.8,0.51,0.12,0.224719,0.561798,0.11236,0.0


In [None]:
fixtures = get_fixtures(12)

In [19]:
schedule = schedule.sort_values(by='date',ascending=False)
schedule.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,week,day,date,time,home_team,home_xg,score,away_xg,away_team,attendance,venue,referee,match_report,notes,game_id
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
ENG-Premier League,2526,2026-05-24 West Ham-Leeds United,38,Sun,2026-05-24,16:00,West Ham,,,,Leeds United,,London Stadium,,,,
ENG-Premier League,2526,2026-05-24 Nott'ham Forest-Bournemouth,38,Sun,2026-05-24,16:00,Nott'ham Forest,,,,Bournemouth,,The City Ground,,,,
ENG-Premier League,2526,2026-05-24 Manchester City-Aston Villa,38,Sun,2026-05-24,16:00,Manchester City,,,,Aston Villa,,Etihad Stadium,,,,
ENG-Premier League,2526,2026-05-24 Liverpool-Brentford,38,Sun,2026-05-24,16:00,Liverpool,,,,Brentford,,Anfield,,,,
ENG-Premier League,2526,2026-05-24 Fulham-Newcastle Utd,38,Sun,2026-05-24,16:00,Fulham,,,,Newcastle Utd,,Craven Cottage,,,,
