# Goal

The goal is to take in a series of inputs for each player available for purchase in FPL -> and turn that into a prediction for their points for the gameweek. 

# What is needed?

In order to generate an expected point value for a player, we need data about players and what they scored each week. <br>

It does not seem like this sort of information is being saved anywhere. As such, the first phase of this project will be setting up the pipeline to collect this data each gameweek. We will want to collect a bunch of information from a few different sources, things like percentage of minutes played, xG Per 90, xA per 90, "threat", "influence, "creativity (those 3 being FPL generated metrics), opposition xG conceded, home or away, etc. <br>

We will want to be able to collect this weekly as a snapshot BEFORE the matches are played. After they are played, we will append a "points_scored" to each record. Eventually we aim to be able to predict this points scored value give all the data we collect, but we need the data in the week-by-week format in order to do this. 

# Phase 1: Week-by-week Historical Data Collection

## 1) Data sources and desired attributes

Here I will outline the specific data sources I am going to pull from, and what data I want. 

### Fbref

Think of this site as providing data from two perspectives: team and individual. <br>

As for team data, we want to have attributes that give an idea of how the individual's team is performing, but also how the team they are playing against is performing. Therefore..

- all expected stats per 90 minutes FOR (don't even pull goals and assists, I just care about expected). We will use this to see how good of an attacking team this player is playing for, and how bad of an attacking team they are playing against
- all expected stats per 90 minutes AGAINST (tells us how good or bad of a defense this player plays for or is up against)

And for the individual perspective:

- percentage of minutes played this season - "min%" (is the player playing a lot?)
- expected stats per 90 (how effective is this player attacking-wise?)
- tackle + challenge + blocks, per 90 data (how effective is this player defensively?)
- yellow/ red cards per 90 (these actions lose points, so we want to know about them)
- penalty share, a number between 0 and 1 (we want to know if a player is their team's penalty kick taker, as this is a good way to get points)

We will also get all the scheduling information out of this site. 

### Official fantasy premier league site

We want to know some stuff as it relates to the game itself. These include:

- price and selection %, won't really assist in predicting points (or rather we don't want to use them for that) but will come in handy for later functionality with the model, like picking differentials and building a squad
- FORM - very important. We want to know how this player is performing coming into the gameweek
- finally, actual points scored.

Remember, these are all snapshot statistics - we want to know what these values were before the gameweek, and after the gameweek, we want to append the points scored to each record. 

### Proposed workflow

1) A script runs to start to fill out the games to be played in the next gameweek. It fills in a record for each player, with the gameweek, individual's team, and opposition.

2) We then access the Fbref data source in order to get team and opposition data. Basically, we will match on the player's team first, getting expected data both for and against - then we repeat the process for the opposition.

3) Now, we have the player, who they are playing, and data about how their team is performing per 90 and how their opposition is performing per 90 up to this point in the season. We should now attach all the data from the player perspective to each row. Get all the per 90 data. This should all be quite simple except for the penalty kick share, which will require a simple calculation to see what perfecntage of a team's penalty kicks the player has taken.

4) Now, join in the data from the official FPL website. Match based on player name, and grab price, % selection, form, and the column "points_scored" but leave this BLANK (we will not know it at the time this script runs).

5) We will let the game week happen, then run the script that gets player points for the week from the official FPL site. Join this in based on player name to the records we just created, using matchweek and player name as the combined key. 

In [102]:
import soccerdata as sd
import pandas as pd
from datetime import datetime
import requests
from thefuzz import process

def get_fixtures(week_wanted):
    """
    grabs the list of games for the week, extracts only the cleaned team names of home and away team, as well as match_week, 
    """
    fbref = sd.FBref(leagues='ENG-Premier League', seasons='2025-2026')
    schedule = fbref.read_schedule()
    schedule['date'] = pd.to_datetime(schedule['date'], errors='coerce')
    schedule = schedule[schedule['week'] == week_wanted]

    return schedule[['home_team','away_team','week']]


def get_fbref_player_stats(season='2025-2026',pt_threshold=40):
    """
    grabs all player individual statistics that we want
    """
    fbref = sd.FBref('ENG-Premier League', season)

    standard = fbref.read_player_season_stats(stat_type="standard")
    shooting = fbref.read_player_season_stats(stat_type="shooting")
    passing = fbref.read_player_season_stats(stat_type="passing")
    defense = fbref.read_player_season_stats(stat_type="defense")
    playing_time = fbref.read_player_season_stats(stat_type="playing_time")

    def flatten_cols(df):
        df = df.copy()
        df.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in df.columns.values]
        return df

    standard = flatten_cols(standard)
    shooting = flatten_cols(shooting)
    passing = flatten_cols(passing)
    defense = flatten_cols(defense)
    playing_time = flatten_cols(playing_time)

    for df in [standard, shooting, passing, defense, playing_time]:
        df.reset_index(inplace=True)
        df.rename(columns={'index': 'player'}, inplace=True)

    metadata_cols = ['season', 'league', 'team', 'nation_', 'pos_', 'age_', 'born_']
    for df in [standard, shooting, passing, defense]:
        df.drop(columns=[c for c in metadata_cols if c in df.columns], inplace=True)

    fbref_stats = standard
    for df in [shooting, passing, defense, playing_time]:
        fbref_stats = fbref_stats.merge(df, on='player', how='outer')

    
    fbref_stats['Tackles_Tkl_per90'] = fbref_stats['Tackles_Tkl'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['Blocks_Blocks_per90'] = fbref_stats['Blocks_Blocks'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['yellow_per90'] = fbref_stats['Performance_CrdY'] / fbref_stats['Playing Time_90s_y']
    fbref_stats['red_per90'] = fbref_stats['Performance_CrdR'] / fbref_stats['Playing Time_90s_y']

    fbref_stats = fbref_stats[fbref_stats['Playing Time_Min%'] >= pt_threshold]

    return fbref_stats[['player','Playing Time_Min%','Per 90 Minutes_xG','Per 90 Minutes_xAG','Tackles_Tkl_per90','Blocks_Blocks_per90','yellow_per90','red_per90']]
    
def get_teams():
    """
    grabs team statistics at this point in time, for each team
    """
    return None

def get_players():
    """
    Grabs a list of all FPL players
    """
    url = "https://fantasy.premierleague.com/api/bootstrap-static/"
    response = requests.get(url)
    data = response.json()
    
    players = pd.DataFrame(data['elements'])
    teams = {team['id']: team['name'] for team in data['teams']}
    players['team_name'] = players['team'].map(teams)
    
    positions = {pos['id']: pos['singular_name'] for pos in data['element_types']}
    players['position'] = players['element_type'].map(positions)
    
    players_df = players[['id', 'first_name', 'second_name', 'team_name', 'position', 'now_cost']].copy()
    players_df['full_name'] = players_df['first_name'] + " " + players_df['second_name']

    # --- Name normalization map ---
    name_map = {
        "Alisson": "Alisson Becker",
        "André": "André Trindade da Costa Neto",
        "Benjamin Šeško": "Benjamin Sesko",
        "Bernardo Silva": "Bernardo Mota Veiga de Carvalho e Silva",
        "Beto": "Norberto Bercique Gomes Betuncal",
        "Bruno Guimarães": "Bruno Guimarães Rodriguez Moura",
        "Casemiro": "Carlos Henrique Casimiro",
        "David Raya": "David Raya Martín",
        "Diego Gómez": "Diego Gómez Amarilla",
        "Diogo Dalot": "Diogo Dalot Teixeira",
        "Emi Buendía": "Emiliano Buendía Stati",
        "Evanilson": "Francisco Evanilson de Lima Barbosa",
        "Ezri Konsa": "Ezri Konsa Ngoyo",
        "Ferdi Kadioglu": "Ferdi Kadıoğlu",
        "Florentino Luís": "Florentino Ibrain Morris Luís",
        "Gabriel Magalhães": "Gabriel dos Santos Magalhães",
        "Hugo Bueno": "Hugo Bueno López",
        "Jeremy Doku": "Jérémy Doku",
        "Joelinton": "Joelinton Cássio Apolinário de Lira",
        "Joshua King": "Josh King",
        "João Gomes": "Gustavo Nunes Fernandes Gomes",
        "João Palhinha": "João Maria Lobo Alves Palhares Costa Palhinha Gonçalves",
        "João Pedro": "João Pedro Junqueira de Jesus",
        "Lucas Paquetá": "Lucas Tolentino Coelho de Lima",  # actual full name
        "Lucas Perri": "Lucas Estella Perri",
        "Marc Cucurella": "Marc Cucurella Saseta",
        "Mateus Fernandes": "Mateus Gonçalo Espanha Fernandes",
        "Matheus Cunha": "Matheus Santos Carneiro da Cunha",
        "Max Kilman": "Maximilian Kilman",
        "Moisés Caicedo": "Moisés Caicedo Corozo",
        "Morato": "Felipe Rodrigues Da Silva",
        "Murillo": "Murillo Costa dos Santos",
        "Nicolás González": "Nico González Iglesias",
        "Pedro Neto": "Pedro Lomba Neto",
        "Pedro Porro": "Pedro Porro Sauceda",
        "Raúl Jiménez": "Raúl Jiménez Rodríguez",
        "Richarlison": "Richarlison de Andrade",
        "Rúben Dias": "Rúben dos Santos Gato Alves Dias",
        "Santiago Bueno": "Santiago Ignacio Bueno",
        "Thiago": "Igor Thiago Nascimento Rodrigues",
        "Valentino Livramento": "Tino Livramento",
        "Yeremi Pino": "Yéremy Pino Santos",
        "Álex Jiménez": "Álex Jiménez Sánchez"
    }

    # --- Apply mapping ---
    players_df['fbref_name'] = players_df['full_name'].apply(lambda x: name_map.get(x, x))
    
    return players_df


def fuzzy_match(fpl_df, fbref_df, threshold=92):
    """
    Fuzzy matches FPL players (already mapped) to FBref player stats by name
    """
    
    fbref_names = fbref_df['player'].tolist()
    fpl_names = fpl_df['fbref_name'].tolist()

    mapping = {}
    for name in fpl_names:
        if pd.isna(name):
            mapping[name] = None
            continue
        match, score = process.extractOne(name, fbref_names)
        mapping[name] = match if score >= threshold else None

    fpl_df['matched_fbref'] = fpl_df['fbref_name'].map(mapping)

    merged = fpl_df.merge(fbref_df, left_on='matched_fbref', right_on='player', how='left')

    return merged


def join_it_all_together():
    return None


In [103]:
# TODO

# get_teams() to get team stats for players

# and tidy up the fuzzy match below ..

In [104]:
df_fpl = get_players()
df_fbref = get_fbref_player_stats()

df = fuzzy_match(df_fpl,df_fbref)

In [105]:
df.head(15)

Unnamed: 0,id,first_name,second_name,team_name,position,now_cost,full_name,fbref_name,matched_fbref,player,Playing Time_Min%,Per 90 Minutes_xG,Per 90 Minutes_xAG,Tackles_Tkl_per90,Blocks_Blocks_per90,yellow_per90,red_per90
0,1,David,Raya Martín,Arsenal,Goalkeeper,59,David Raya Martín,David Raya Martín,,,,,,,,,
1,2,Kepa,Arrizabalaga Revuelta,Arsenal,Goalkeeper,42,Kepa Arrizabalaga Revuelta,Kepa Arrizabalaga Revuelta,,,,,,,,,
2,3,Karl,Hein,Arsenal,Goalkeeper,40,Karl Hein,Karl Hein,,,,,,,,,
3,4,Tommy,Setford,Arsenal,Goalkeeper,40,Tommy Setford,Tommy Setford,,,,,,,,,
4,5,Gabriel,dos Santos Magalhães,Arsenal,Defender,66,Gabriel dos Santos Magalhães,Gabriel dos Santos Magalhães,,,,,,,,,
5,6,William,Saliba,Arsenal,Defender,60,William Saliba,William Saliba,William Saliba,William Saliba,73.1,0.03,0.01,1.25,0.5,0.0,0.0
6,7,Riccardo,Calafiori,Arsenal,Defender,58,Riccardo Calafiori,Riccardo Calafiori,Riccardo Calafiori,Riccardo Calafiori,86.5,0.23,0.05,1.368421,0.736842,0.315789,0.0
7,8,Jurriën,Timber,Arsenal,Defender,61,Jurriën Timber,Jurriën Timber,Jurriën Timber,Jurriën Timber,89.1,0.2,0.06,3.469388,0.918367,0.204082,0.0
8,9,Jakub,Kiwior,Arsenal,Defender,54,Jakub Kiwior,Jakub Kiwior,,,,,,,,,
9,10,Myles,Lewis-Skelly,Arsenal,Defender,51,Myles Lewis-Skelly,Myles Lewis-Skelly,,,,,,,,,


In [107]:
df['fbref_name'].isna().sum()


np.int64(0)

In [60]:
df_a = get_players()
df_a = df_a.sort_vales(by='full_name')
df_a.head()

Unnamed: 0,id,first_name,second_name,team_name,position,now_cost,full_name
0,1,David,Raya Martín,Arsenal,Goalkeeper,59,David Raya Martín
1,2,Kepa,Arrizabalaga Revuelta,Arsenal,Goalkeeper,42,Kepa Arrizabalaga Revuelta
2,3,Karl,Hein,Arsenal,Goalkeeper,40,Karl Hein
3,4,Tommy,Setford,Arsenal,Goalkeeper,40,Tommy Setford
4,5,Gabriel,dos Santos Magalhães,Arsenal,Defender,66,Gabriel dos Santos Magalhães


In [75]:
from thefuzz import process

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)  # optional: show all rows
pd.set_option('display.max_columns', None)  # optional: show all columns


def find_best_fbref_matches(df_nones, fbref_names, limit=10, threshold=40):
    """
    For each missing fbref_name, find the top fuzzy matches from the list of FBref player names.
    """
    suggestions = []
    
    for name in df_nones['player']:
        matches = process.extract(name, fbref_names, limit=limit)
        # Keep only matches above the threshold
        good_matches = [m for m in matches if m[1] >= threshold]
        
        suggestions.append({
            'player': name,
            'suggested_matches': good_matches
        })
    
    return pd.DataFrame(suggestions)

# Example usage:
fbref_names = list(df_a['full_name'].unique())
suggestions_df = find_best_fbref_matches(df_nones, fbref_names)

suggestions_df.head(50)


Unnamed: 0,player,suggested_matches
0,Alisson,"[(Alisson Becker, 90), (Richarlison de Andrade, 77), (Hákon Rafn Valdimarsson, 69), (Reiss Nelson, 65), (Harry Wilson, 65), (Gabriel Gudmundsson, 65), (Jack Harrison, 65), (Kieran Morrison, 65), (Elyh Harrison, 65), (James Maddison, 65)]"
1,André,"[(Leandro Trossard, 90), (Andre Harriman-Annous, 90), (Andrés García, 90), (Andrew Moran, 90), (Andrey Nascimento dos Santos, 90), (Kendry Páez Andrade, 90), (Alejandro Garnacho Ferreyra, 90), (Andreas Hoelgebaum Pereira, 90), (Andrew Robertson, 90), (André Onana, 90)]"
2,Benjamin Šeško,"[(Benjamin Sesko, 86), (Benjamin Lecomte, 73), (Benjamin White, 71), (Benjamin Fredrick, 71), (Benjamin Arthur, 69), (Jake O'Brien, 51), (Bernd Leno, 50), (Ben Davies, 50), (Yang Min-hyeok, 50), (Son Heung-min, 49)]"
3,Bernardo Silva,"[(Gabriel Martinelli Silva, 86), (Bernardo Mota Veiga de Carvalho e Silva, 86), (Felipe Rodrigues da Silva, 86), (João Pedro Ferreira da Silva, 86), (Eric da Silva Moreira, 86), (Luis Eduardo Soares da Silva, 86), (Luís Hemir Silva Semedo, 86), (João Victor Gomes da Silva, 86), (Borna Sosa, 58), (Bernd Leno, 58)]"
4,Beto,"[(Norberto Murara Neto, 77), (Julio Soler Barreto, 77), (Pedro Lomba Neto, 77), (André Trindade da Costa Neto, 77), (Marcus Bettinelli, 73), (Tommy Setford, 68), (Albert Sambi Lokonga, 68), (Connor Roberts, 68), (Đorđe Petrović, 68), (Julian Eyestone, 68)]"
5,Bruno Guimarães,"[(Bruno Guimarães Rodriguez Moura, 90), (Bruno Borges Fernandes, 86), (Marc Guéhi, 60), (Rio Ngumoha, 56), (Brajan Gruda, 54), (Rio Cardines, 54), (Jacob Bruun Larsen, 53), (Marcos Senesi Barón, 53), (Bashir Humphreys, 53), (Manuel Ugarte Ribeiro, 53)]"
6,Casemiro,"[(Carlos Henrique Casimiro, 79), (Matheus Santos Carneiro da Cunha, 68), (Aarón Anselmino, 60), (Darwin Núñez Ribeiro, 60), (Manuel Ugarte Ribeiro, 60), (Nathan Fraser, 60), (Kaelan Casey, 57), (Lucas Pires Silva, 56), (Emile Smith Rowe, 56), (Antoñito Cordero Campillo, 56)]"
7,David Raya,"[(David Raya Martín, 90), (David Mota Veiga Teixeira do Carmo, 86), (David Møller Wolfe, 86), (David Brooks, 64), (David Ozoh, 63), (Calvin Ramsay, 61), (Ryan McAidoo, 60), (Yasin Ayari, 57), (Archie Gray, 57), (Rayan Aït-Nouri, 55)]"
8,Diego Gómez,"[(Diego Gómez Amarilla, 90), (Diego León Blanco, 86), (Diego Coppola, 63), (Joe Gomez, 63), (Dermot Mee, 60), (Gustavo Nunes Fernandes Gomes, 57), (Rodrigo Muniz Carvalho, 57), (Rodrigo Martins Gomes, 57), (Diogo Dalot Teixeira, 56), (Daniel James, 55)]"
9,Diogo Dalot,"[(Diogo Dalot Teixeira, 90), (Rodrigo Martins Gomes, 57), (Amad Diallo, 55), (João Victor Gomes da Silva, 53), (Igor Thiago Nascimento Rodrigues, 53), (Diego Gómez Amarilla, 53), (Gonçalo Manuel Ganchinho Guedes, 53), (Diego León Blanco, 52), (Jamaldeen Jimoh-Aloba, 51), (Igor Julio dos Santos de Paulo, 51)]"


In [59]:
df_nones = df[df['fbref_name'].isna()]
df_nones.head(50)

Unnamed: 0,player,Playing Time_Min%,Per 90 Minutes_xG,Per 90 Minutes_xAG,Tackles_Tkl_per90,Blocks_Blocks_per90,yellow_per90,red_per90,id,first_name,second_name,team_name,position,now_cost,full_name,fbref_name
6,Alisson,54.5,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,
11,André,71.2,0.0,0.01,2.307692,1.410256,0.384615,0.0,,,,,,,,
19,Benjamin Šeško,62.7,0.3,0.05,0.434783,0.724638,0.144928,0.0,,,,,,,,
20,Bernardo Silva,65.1,0.01,0.27,1.25,0.694444,0.277778,0.0,,,,,,,,
22,Beto,61.5,0.61,0.06,0.882353,0.588235,0.147059,0.0,,,,,,,,
27,Bruno Guimarães,90.3,0.16,0.06,2.222222,0.30303,0.30303,0.0,,,,,,,,
34,Casemiro,62.0,0.2,0.1,3.235294,1.764706,0.735294,0.147059,,,,,,,,
50,David Raya,100.0,0.0,0.0,0.0,0.0,0.090909,0.0,,,,,,,,
53,Diego Gómez,54.6,0.3,0.15,4.5,1.833333,0.333333,0.0,,,,,,,,
54,Diogo Dalot,60.0,0.01,0.16,1.969697,0.454545,0.0,0.0,,,,,,,,


In [None]:
fixtures = get_fixtures(12)

In [19]:
schedule = schedule.sort_values(by='date',ascending=False)
schedule.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,week,day,date,time,home_team,home_xg,score,away_xg,away_team,attendance,venue,referee,match_report,notes,game_id
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
ENG-Premier League,2526,2026-05-24 West Ham-Leeds United,38,Sun,2026-05-24,16:00,West Ham,,,,Leeds United,,London Stadium,,,,
ENG-Premier League,2526,2026-05-24 Nott'ham Forest-Bournemouth,38,Sun,2026-05-24,16:00,Nott'ham Forest,,,,Bournemouth,,The City Ground,,,,
ENG-Premier League,2526,2026-05-24 Manchester City-Aston Villa,38,Sun,2026-05-24,16:00,Manchester City,,,,Aston Villa,,Etihad Stadium,,,,
ENG-Premier League,2526,2026-05-24 Liverpool-Brentford,38,Sun,2026-05-24,16:00,Liverpool,,,,Brentford,,Anfield,,,,
ENG-Premier League,2526,2026-05-24 Fulham-Newcastle Utd,38,Sun,2026-05-24,16:00,Fulham,,,,Newcastle Utd,,Craven Cottage,,,,
