In [12]:
import soccerdata as sd
import pandas as pd
from datetime import datetime

# Goal

The goal is to take in a series of inputs for each player available for purchase in FPL -> and turn that into a prediction for their points for the gameweek. 

# What is needed?

In order to generate an expected point value for a player, we need data about players and what they scored each week. <br>

It does not seem like this sort of information is being saved anywhere. As such, the first phase of this project will be setting up the pipeline to collect this data each gameweek. We will want to collect a bunch of information from a few different sources, things like percentage of minutes played, xG Per 90, xA per 90, "threat", "influence, "creativity (those 3 being FPL generated metrics), opposition xG conceded, home or away, etc. <br>

We will want to be able to collect this weekly as a snapshot BEFORE the matches are played. After they are played, we will append a "points_scored" to each record. Eventually we aim to be able to predict this points scored value give all the data we collect, but we need the data in the week-by-week format in order to do this. 

# Phase 1: Week-by-week Historical Data Collection

## 1) Data sources and desired attributes

Here I will outline the specific data sources I am going to pull from, and what data I want. 

### Fbref

Think of this site as providing data from two perspectives: team and individual. <br>

As for team data, we want to have attributes that give an idea of how the individual's team is performing, but also how the team they are playing against is performing. Therefore..

- all expected stats per 90 minutes FOR (don't even pull goals and assists, I just care about expected). We will use this to see how good of an attacking team this player is playing for, and how bad of an attacking team they are playing against
- all expected stats per 90 minutes AGAINST (tells us how good or bad of a defense this player plays for or is up against)

And for the individual perspective:

- percentage of minutes played this season - "min%" (is the player playing a lot?)
- expected stats per 90 (how effective is this player attacking-wise?)
- tackle + challenge + blocks, per 90 data (how effective is this player defensively?)
- yellow/ red cards per 90 (these actions lose points, so we want to know about them)
- penalty share, a number between 0 and 1 (we want to know if a player is their team's penalty kick taker, as this is a good way to get points)

We will also get all the scheduling information out of this site. 

### Official fantasy premier league site

We want to know some stuff as it relates to the game itself. These include:

- price and selection %, won't really assist in predicting points (or rather we don't want to use them for that) but will come in handy for later functionality with the model, like picking differentials and building a squad
- FORM - very important. We want to know how this player is performing coming into the gameweek
- finally, actual points scored.

Remember, these are all snapshot statistics - we want to know what these values were before the gameweek, and after the gameweek, we want to append the points scored to each record. 

### Proposed workflow

1) A script runs to start to fill out the games to be played in the next gameweek. It fills in a record for each player, with the gameweek, individual's team, and opposition.

2) We then access the Fbref data source in order to get team and opposition data. Basically, we will match on the player's team first, getting expected data both for and against - then we repeat the process for the opposition.

3) Now, we have the player, who they are playing, and data about how their team is performing per 90 and how their opposition is performing per 90 up to this point in the season. We should now attach all the data from the player perspective to each row. Get all the per 90 data. This should all be quite simple except for the penalty kick share, which will require a simple calculation to see what perfecntage of a team's penalty kicks the player has taken.

4) Now, join in the data from the official FPL website. Match based on player name, and grab price, % selection, form, and the column "points_scored" but leave this BLANK (we will not know it at the time this script runs).

5) We will let the game week happen, then run the script that gets player points for the week from the official FPL site. Join this in based on player name to the records we just created, using matchweek and player name as the combined key. 

In [27]:
def get_fixtures(week_wanted):
    """
    grabs the list of games for the week, extracts only the cleaned team names of home and away team, as well as match_week, 
    """
    fbref = sd.FBref(leagues='ENG-Premier League', seasons='2025-2026')
    schedule = fbref.read_schedule()
    schedule['date'] = pd.to_datetime(schedule['date'], errors='coerce')
    schedule = schedule[schedule['week'] == week_wanted]

    return schedule[['home_team','away_team','week']]

def get_players():
    """
    grabs list of all valid FPL players, and who they play for. Also grabs their current individual statistics up to this point in time
    """
    return None

def get_teams():
    """
    grabs team statistics at this point in time, for each team
    """
    return None

In [None]:
fixtures = get_fixtures(12)

In [19]:
schedule = schedule.sort_values(by='date',ascending=False)
schedule.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,week,day,date,time,home_team,home_xg,score,away_xg,away_team,attendance,venue,referee,match_report,notes,game_id
league,season,game,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
ENG-Premier League,2526,2026-05-24 West Ham-Leeds United,38,Sun,2026-05-24,16:00,West Ham,,,,Leeds United,,London Stadium,,,,
ENG-Premier League,2526,2026-05-24 Nott'ham Forest-Bournemouth,38,Sun,2026-05-24,16:00,Nott'ham Forest,,,,Bournemouth,,The City Ground,,,,
ENG-Premier League,2526,2026-05-24 Manchester City-Aston Villa,38,Sun,2026-05-24,16:00,Manchester City,,,,Aston Villa,,Etihad Stadium,,,,
ENG-Premier League,2526,2026-05-24 Liverpool-Brentford,38,Sun,2026-05-24,16:00,Liverpool,,,,Brentford,,Anfield,,,,
ENG-Premier League,2526,2026-05-24 Fulham-Newcastle Utd,38,Sun,2026-05-24,16:00,Fulham,,,,Newcastle Utd,,Craven Cottage,,,,


In [20]:
# second, get team and opposition per 90 statistics

In [21]:
# third, get all player per 90 data

In [22]:
# fourth, join in FPL website data, blank values for gameweek_points_scored

In [None]:
# fifth, run script to go grab the gameweek we last grabbed, and match in the points they scored