# EPL Fantasy 6-Week Predictor Model
A few notes on the model, 
- I assume, across seasons, game weeks X through Y are comparable. 2020 will be a special case, but for the purpose of this excercise I run the same assumption.
- I went with the kitchen sink approach here in choosing the random forest model, I've done model evaluation in the past and found there wasn't a notable difference in methods for this excercise
- I consider a +/- difference of 4 or less a fairly solid prediction. Obviously the total can add up if each of your players end up being off by 4 or more, but in most gameweeks I either beat or tie the average using the predictions here.

In [52]:
import pandas as pd
import numpy as np
import requests
import pulp
from IPython.display import display, Markdown, Latex
display(Markdown('*some markdown* $\phi$'))

*some markdown* $\phi$

## Training the model

In [53]:
df_import = pd.read_csv("https://raw.githubusercontent.com/tprice90/epl_fantasy_predictor/main/Data%20Prep/16-20%20Combined.csv")
df_import = pd.DataFrame(df_import)
df_import = df_import.drop(columns=['Unnamed: 0'], axis=0)
df = df_import.drop(columns=['name','element','id'], axis=0)
df.head()


Unnamed: 0,assists,bonus,bps,clean_sheets,goals_conceded,goals_scored,own_goals,penalties_missed,penalties_saved,red_cards,...,influence,threat,value,n6_total_points,n6_opponent_difficulty,perc_min_played,element_type_1,element_type_2,element_type_3,element_type_4
0,0,1,102,2,6,0,0,0,0,0,...,7.2,8.0,53,12,2.5,1.0,0,1,0,0
1,1,0,62,2,2,0,0,0,0,0,...,0.0,18.0,55,1,2.166667,0.577778,0,0,1,0
2,0,0,26,0,3,0,0,0,0,0,...,7.2,37.0,76,6,2.4,0.233333,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,45,0,2.166667,0.0,0,0,1,0
4,0,0,18,0,6,0,0,0,0,0,...,3.6,18.0,46,7,2.833333,0.333333,0,0,1,0


In [54]:
# Use numpy to convert to arrays

# Labels are the values we want to predict
labels = np.array(df['n6_total_points'])
# Remove the labels from the features
# axis 1 refers to the columns
features = df.drop('n6_total_points', axis = 1)
# Saving feature names for later use
features_list = list(features.columns)
# Convert to numpy array
features = np.array(features)

In [55]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [56]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (1770, 28)
Training Labels Shape: (1770,)
Testing Features Shape: (591, 28)
Testing Labels Shape: (591,)


In [57]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

In [58]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'points')

Mean Absolute Error: 5.12 points


In [59]:
rf.fit(features, labels)

RandomForestRegressor(n_estimators=1000, random_state=42)

In [60]:
predictions = rf.predict(features)

In [61]:
df_import['predictions'] = predictions.round()
df_import.head()

Unnamed: 0,name,assists,bonus,bps,clean_sheets,goals_conceded,goals_scored,own_goals,penalties_missed,penalties_saved,...,value,n6_total_points,n6_opponent_difficulty,perc_min_played,id,element_type_1,element_type_2,element_type_3,element_type_4,predictions
0,Aaron_Cresswell,0,1,102,2,6,0,0,0,0,...,53,12,2.5,1.0,454.0,0,1,0,0,13.0
1,Aaron_Lennon,1,0,62,2,2,0,0,0,0,...,55,1,2.166667,0.577778,142.0,0,0,1,0,5.0
2,Aaron_Ramsey,0,0,26,0,3,0,0,0,0,...,76,6,2.4,0.233333,16.0,0,0,1,0,8.0
3,Aaron_Wan-Bissaka,0,0,0,0,0,0,0,0,0,...,45,0,2.166667,0.0,612.0,0,0,1,0,0.0
4,Abdoulaye_Doucouré,0,0,18,0,6,0,0,0,0,...,46,7,2.833333,0.333333,482.0,0,0,1,0,7.0


## Pulling most recent gameweek data and creating predictions

In [68]:
##First 6 GW of 2019-20
url = ["https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw15.csv",
      "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw16.csv",
      "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw17.csv",
      "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw18.csv",
      "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw19.csv",
      "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw20.csv"]

In [69]:
df_n6 = pd.read_csv("https://raw.githubusercontent.com/tprice90/epl_fantasy_predictor/main/Data%20Prep/upcoming_six_games.csv")
df_n6 = pd.DataFrame(df_n6)

df_n6.head(20)

Unnamed: 0,team,n6_opponent_difficulty,team_code
0,Arsenal,3.333333,1
1,Aston Villa,3.0,2
2,Brighton,3.166667,3
3,Burnley,3.333333,4
4,Chelsea,3.333333,5
5,Crystal Palace,2.5,6
6,Everton,3.166667,7
7,Fulham,3.0,8
8,Leicester,3.166667,9
9,Leeds,3.0,10


In [70]:
fix_diff = pd.read_csv("https://raw.githubusercontent.com/tprice90/epl_fantasy_predictor/main/Data%20Prep/epl_difficulty_20_21.csv")
fix_diff = pd.DataFrame(fix_diff)

df_20_21 = pd.DataFrame()
for data in url:
    df = pd.read_csv(data, encoding = "ISO-8859-1")
    df_20_21 = df_20_21.append(df)
    
df_20_21 = df_20_21.merge(fix_diff, left_on='opponent_team', right_on='opponent_team')
df_20_21_agg = df_20_21.groupby(['name'], as_index=False).agg({'assists':'sum',
                                                             'bonus':'sum',
                                                             'bps':'sum',
                                                             'clean_sheets':'sum',                                                             
                                                              'goals_conceded':'sum',
                                                              'goals_scored':'sum',
                                                              'minutes':'sum',
                                                              'own_goals':'sum',
                                                              'penalties_missed':'sum',
                                                              'penalties_saved':'sum',
                                                              'red_cards':'sum',
                                                              'saves':'sum',
                                                              'selected':'sum',
                                                              'total_points':'sum',
                                                              'transfers_balance':'sum',
                                                              'was_home':'sum',
                                                              'yellow_cards':'sum',
                                                              'opponent_difficulty':'mean',
                                                              'round':'count'})

df_20_21_agg['round'] = max(df_20_21_agg['round'])

##Get Last Value for players in GW range
df_20_21_last = df_20_21.groupby('name',as_index=False).last()
df_20_21_last = df_20_21_last[['name','element','position','team','creativity','ict_index','influence','threat','value']]

df_20_21 = df_20_21_agg.merge(df_20_21_last, left_on='name', right_on='name')
df_20_21['perc_min_played'] = df_20_21['minutes']/(df_20_21['round']*90)
df_20_21 = df_20_21.drop(columns=['minutes','round'], axis=0)
df_20_21 = df_20_21.merge(df_n6, left_on='team', right_on='team')
df_20_21.head()
df_20_21 = df_20_21.drop('team', axis=1)

df_20_21['element_type_1'] = np.where(df_20_21['position']== 'GK', 1, 0)
df_20_21['element_type_2'] = np.where(df_20_21['position']== 'DEF', 1, 0)
df_20_21['element_type_3'] = np.where(df_20_21['position']== 'MID', 1, 0)
df_20_21['element_type_4'] = np.where(df_20_21['position']== 'FWD', 1, 0)

df_20_21['element_type'] = 1
df_20_21.loc[df_20_21['position']== 'DEF', 'element_type'] = 2
df_20_21.loc[df_20_21['position']== 'MID', 'element_type'] = 3
df_20_21.loc[df_20_21['position']== 'FWD', 'element_type'] = 4
df_20_21.tail()

Unnamed: 0,name,assists,bonus,bps,clean_sheets,goals_conceded,goals_scored,own_goals,penalties_missed,penalties_saved,...,threat,value,perc_min_played,n6_opponent_difficulty,team_code,element_type_1,element_type_2,element_type_3,element_type_4,element_type
639,Serge Aurier,0,0,48,0,3,1,0,0,0,...,0.0,52,0.357143,3.166667,17,0,1,0,0,2
640,Sergio ReguilÃ³n,1,3,59,1,2,0,0,0,0,...,25.0,57,0.384127,3.166667,17,0,1,0,0,2
641,Steven Bergwijn,3,0,41,1,5,0,0,0,0,...,0.0,70,0.449206,3.166667,17,0,0,1,0,3
642,Tanguy Ndombele,0,5,104,2,5,2,0,0,0,...,7.0,59,0.644444,3.166667,17,0,0,1,0,3
643,Toby Alderweireld,0,1,31,1,0,1,0,0,0,...,0.0,54,0.142857,3.166667,17,0,1,0,0,2


In [71]:
df_pred = df_20_21.drop(['name','element','position','element_type','team_code'], axis=1)
predictions = rf.predict(df_pred)


## Linear Programming to Maximize our Fantasy Scores given a Budget Constraint

In [72]:
df_20_21['predictions'] = predictions.round()

#drop injured players showing up
# df_20_21 = df_20_21[df_20_21['name'] != 'Ãaglar SÃ¶yÃ¼ncÃ¼'] 
# df_20_21 = df_20_21[df_20_21['name'] != 'Kevin De Bruyne'] 


expected_scores = df_20_21['predictions']
prices = df_20_21['value']/10
positions = df_20_21['element_type']
clubs = df_20_21['team_code']
names = df_20_21['name']

In [73]:

def select_team(expected_scores, prices, positions, clubs, total_budget=99.8, sub_factor=0.2):
    num_players = len(expected_scores)
    model = pulp.LpProblem("Constrained value maximisation", pulp.LpMaximize)
    decisions = [
        pulp.LpVariable("x{}".format(i), lowBound=0, upBound=1, cat='Integer')
        for i in range(num_players)
    ]
    captain_decisions = [
        pulp.LpVariable("y{}".format(i), lowBound=0, upBound=1, cat='Integer')
        for i in range(num_players)
    ]
    sub_decisions = [
        pulp.LpVariable("z{}".format(i), lowBound=0, upBound=1, cat='Integer')
        for i in range(num_players)
    ]


    # objective function:
    model += sum((captain_decisions[i] + decisions[i] + sub_decisions[i]*sub_factor) * expected_scores[i]
                 for i in range(num_players)), "Objective"

    # cost constraint
    model += sum((decisions[i] + sub_decisions[i]) * prices[i] for i in range(num_players)) <= total_budget  # total cost

    # position constraints
    # 1 starting goalkeeper
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 1) == 1
    # 2 total goalkeepers
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if positions[i] == 1) == 2

    # 3-5 starting defenders
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 2) >= 3
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 2) <= 5
    # 5 total defenders
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if positions[i] == 2) == 5

    # 3-5 starting midfielders
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 3) >= 3
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 3) <= 5
    # 5 total midfielders
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if positions[i] == 3) == 5

    # 1-3 starting attackers
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 4) >= 1
    model += sum(decisions[i] for i in range(num_players) if positions[i] == 4) <= 3
    # 3 total attackers
    model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if positions[i] == 4) == 3

    # club constraint
    for club_id in np.unique(clubs):
        model += sum(decisions[i] + sub_decisions[i] for i in range(num_players) if clubs[i] == club_id) <= 3  # max 3 players

    model += sum(decisions) == 11  # total team size
    model += sum(captain_decisions) == 1  # 1 captain
    
    for i in range(num_players):  
        model += (decisions[i] - captain_decisions[i]) >= 0  # captain must also be on team
        model += (decisions[i] + sub_decisions[i]) <= 1  # subs must not be on team

    model.solve()
    print("Total expected score = {}".format(model.objective.value()))

    return decisions, captain_decisions, sub_decisions

### Optimal Team Choice for Next 6 Weeks given our Budget

In [74]:
decisions, captain_decisions, sub_decisions = select_team(expected_scores.values, prices.values, positions.values, clubs.values)
# print results
for i in range(df.shape[0]):
    if decisions[i].value() != 0:
        display(Markdown("**{}** Points = {}, Price = {}".format(names[i], expected_scores[i], prices[i])))
print()
print("Subs:")
# print results
for i in range(df.shape[0]):
    if sub_decisions[i].value() == 1:
        display(Markdown("**{}** Points = {}, Price = {}".format(names[i], expected_scores[i], prices[i])))

print()
print("Captain:")
# print results
for i in range(df.shape[0]):
    if captain_decisions[i].value() == 1:
        display(Markdown("**{}** Points = {}, Price = {}".format(names[i], expected_scores[i], prices[i])))



Total expected score = 362.99999999999994


**Michail Antonio** Points = 23.0, Price = 6.3

**Bruno Miguel Borges Fernandes** Points = 29.0, Price = 11.1

**Pedro Lomba Neto** Points = 25.0, Price = 6.0

**Willy Boly** Points = 22.0, Price = 5.4

**Ederson Santana de Moraes** Points = 35.0, Price = 6.0

**Ilkay GÃ¼ndogan** Points = 32.0, Price = 5.6

**Phil Foden** Points = 31.0, Price = 6.3

**Andrew Robertson** Points = 26.0, Price = 7.3

**Trent Alexander-Arnold** Points = 28.0, Price = 7.2

**Ollie Watkins** Points = 26.0, Price = 6.1

**Pierre-Emerick Aubameyang** Points = 34.0, Price = 11.4


Subs:


**Rodrigo Moreno** Points = 21.0, Price = 5.7

**Bernd Leno** Points = 22.0, Price = 4.9

**Kieran Tierney** Points = 22.0, Price = 5.3

**Kyle Walker-Peters** Points = 20.0, Price = 4.8


Captain:


**Ederson Santana de Moraes** Points = 35.0, Price = 6.0