# DS3000 Project
## Fantasy Premier League Soccer Predictor
### Joey Sola-Sole, Harsh Sethia, Sarah Casale

# 1.1 Problem Statement

The purpose of this project is to enable Fantasty Premier League players to better analyze the performance of players and therefore be able to make a more informed decision about which players they should buy, hold, or sell throughout the course of the season.

During each matchweek each participant will field a squad of players that they feel will bring in the most points during their matches. The score of each player depends on both their performance and their position. The following is the breakdown of the scoring by the Fantasty Premier League website:

- All players receive points for the following:
    - (1) Point for playing up to 60 minutes.
    - (2) Points for playing 60 minutes or more (excluding stoppage time).
    - (3) Points for each goal assist.
    - (-2) Points for each penalty miss.
    - (1-3) Bonus Points for the best rated players in the match.
    - (-1) Points for each yellow card received.
    - (-3) Points for each red card receieved.
    - (-2) Points for each own goal scored.
- Goalkeepers receive points for the following:
    - (6) Points for scoring a goal.
    - (4) Points for a clean sheet.
    - (1) Points for every 3 shots saved.
    - (5) Points for each penalty save.
    - (-1) Points for every 2 goals conceded.
- Defenders receive points for the following:
    - (6) Points for scoring a goal.
    - (4) Points for a clean sheet.
    - (-1) Points for every 2 goals conceded.
- Midfielders receive points for the following:
    - (5) Points for scoring a goal.
    - (1) Points for a clean sheet.
- Forwards recieve points for the following:
    - (4) Points for scoring a goal.

Our project identifies the best potential players for each matchweek by analyzing their position and their previous performance record.

**1.2 Significance of the Problem**

Participants of the Fantasty Premier League must field a team against different players every week in the hopes to win a grand prize at the end of the season. Each matchweek, Fantasty players spend a lot of time trying to find and analyze statistics in order to make an educated decision as to which players on their squad will perform the best.

This project not only provides players access to better statistics, it also allows them to efficiently analyze the data that they're given with a predicted performance score that will allow them to make faster decisions. It also allows them to compare various players against one another, as well as taking a player's opponent difficulty into account, in order to make a long-term strategy for their squad.

**1.3 Questions and Hypothesis**

Before we began writing the code for our project we first wanted to learn more about the following questions and how they would effect our predictions:

1. Does the position of the player effect how many points they receive?
    - Null Hypothesis: The position of the player doesn't effect how many points that they are able to receive in a given matchweek.
  
    - Alternative Hypothesis: The position of the player does effect how many points that they are able to receieve in a given matchweek.
2. Does the difficultly level of the player's opponent team that week effect the value of the points they are able to achieve?
    - Null Hypothesis: The difficulty level of the player's opponent team that week does not effect the value of the points they are able to achieve.
  
    - Alternative Hypothesis: The difficulty level of the player's opponent team that week effects the value of the points they are able to achieve.
3. Does performance outside of league play (International Competitions / Club Competitions) effect the player's value?
    - Null Hypothesis: The player's performance outside of league play does not effect the player's value.

    - Alternative Hypothesis: The player's performance outside of league play does effect the player's value. 
5. Question that compares the accuracy of the machine learning algorithms that we used.

# 2. Method

**2.1 Data Acquisition**

- Our data was obtained from a github user who scraped the premier league fantasy website between the years 2016 and 2020 to get data on players for every game in that time.

  - The data we used was a compilation of hundreds of files made by this user into one dataframe
https://github.com/vaastav/Fantasy-Premier-League. 

  - So we ran an algorithm to visit every file on github, and get the data and compile it together through code to obtain one giant database.

- Once, put all together the dataset is 114803 rows × 31 columns.
  - Each row represents a players performance from one game. There are 5 different seasons in the dataset and 38 games for each season. This means there are about 600 players per season.
  - Each column represents a statistic from one game. For example, it could be number of goals or assist, or minutes played, etc.

- Later, we add on our own custom columns which shows trend and form
  - The trend and form columns are the ones we will use to predict a players score for an upcoming game.
  - trend is the average of the slope between the last 4 games
for example, if a player scores 3 goals, then 2, then 0, then 2, their trend is -.333
  - form is the average over the last 4 games
with the same example, if a player scores 3 goals, then 2, then 0, then 2, their form is 1.75

**2.2 Data Analysis**
- With our model, we are trying to predict the total points that a player will score in the upcoming week based on past performances

  - The model will base its predictions on all of the calculated trend and form columns, whether its a home game, and the opponent
  - Those are the most important becuase form is often a very good predictor of how a player will do, and the home field advantage is very important in spots. 
  - Also the a good opponent will obviously lower the points compared to a bad opponent
this is a supervised ML problem because we have labeled data. 
  - It is regression, becuase we are trying to predict the number of points a player will score, not classify them into groups.

- After testing Linear Regression, Ridge, Lasso, KNeighborsRegressor, and LinearSVR, we determined that Linear Regression Produced the best results. We also used train_test_split, MinMaxScaler, and RFE with a DecisionTreeRegressor. We decided to use each of these because each of them fit the type of model we wanted and improved our test r2 scores over other algorithms

# 3. Results

**3.1 Data Wrangling**

The dataset that we used for the player statistics were taken/scraped from  https://github.com/vaastav/Fantasy-Premier-League, specifically from the player directory consisting of 500+ player data for each season.

The data that we cleaned was based on the most important point-scoring statistics on the Fantasy Premier League website (https://fantasy.premierleague.com/help/rules) and what additional data we needed in the end. Descriptions of each variable are as follows: 
- Gameweek: The week of a season in which a game is played
- Assists: How many assists a player gave
- Big Chances Created: How many big chances a player created, more big chances indicates more luck with assists and goals in future
- Bonus: Adiitional points scored based on performance
- Clean Sheets: 1 if No goal was conceeded, 0 otherwise
- Fouls: Fouls comitted by a player in a game
- Goals Conceded: How many goals was the opposition able to score

Regarding the rubric:
- Our big dataframe didn't drop every value initially (to keep the NA records of players), but prior to analysis, we did used the dropna function to remove those values and we divided the players into 4 categories based on position. Players with no positions were not analyzed due to leass information available.
- We applied a lot of functions to clean up the data, remove columns and add positions to every player record.
- Variables were preprocessed prior to analysis
- Feature extraction took place based on the player position, since every position category's points depend on different features.

First we pull the data from the source
we found it impossible to pull from the source for this file becuase it was hundreds of csv's in different randomly named folders, so we will link to the original github with all of the files which we iterated through https://github.com/vaastav/Fantasy-Premier-League. Other files will be pulled from the group's personal github listed below.

https://github.com/jsolasole/DS300-Project

all files after the first step are placed in the github, and any use of the dataframes afterwards will use the link


1. Here we iterate through every folder in each folder named after the year of the season
2. after that we add each of those files to the dictionary based on year to be read
2. then we read each csv as a df into the new dictionary

In [2]:
# import required module
import os
import pandas as pd

""" Read all necessary csv files into the dictionary from files on computer"""
years = {"2016-17": [], "2017-18": [], "2018-19": [], "2019-20": [], "2020-21": []}

#https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2020-21/gws/gw1.csv



for year, files_list in years.items():
    print(year)
    directory = f'Fantasy-Premier-League/data/{year}/players/'
   
    # player directory
    for player_dir in os.listdir(directory):
        player_dir = os.path.join(directory, player_dir)

        if os.path.isdir(player_dir):

            # file in player directory
            for file in os.listdir(player_dir):

                if os.path.join(player_dir, file) not in files_list:
                    file = os.path.join(player_dir, file)
                    
                    #if it is of the file type we want
                    if file[-6:] == "gw.csv":
                        files_list.append(file)
                        break


"""read all of the csv files one by one and add to the df"""
player_stats_by_year = {"2016-17": {}, "2017-18": {}, "2018-19": {}, "2019-20": {}, "2020-21": {}}

# for each season in the dictionary
for year, dictionary in player_stats_by_year.items():
    final_df = pd.DataFrame()
    
    #for every csv file in each year
    for csv in years[year]:
        
        # get playername and dataframe
        player = csv.split('/')[-2]
        df = pd.read_csv(csv)
            
        # reformat the player dataframe
        df.reset_index(inplace = True)
        df = df.rename(columns={"index":"gameweek"})
        df["player"] = player
        df.set_index("player", inplace = True)

        final_df = final_df.append(df)
    player_stats_by_year[year] = final_df

2016-17
2017-18
2018-19
2019-20
2020-21


dataframe with all of the positions of each player, because the positions were not included in the original dataframe
- data is pulled form online repository

In [3]:
directory = 'https://raw.githubusercontent.com/jsolasole/DS300-Project/main/all_pos.csv'

# player directory
pos_df = pd.read_csv(directory)
pos_df

Unnamed: 0,player,position
0,Clinton N'Jie,Unknown
1,Lewis Grabban,Unknown
2,Jose Luis Mato Sanmartn,Unknown
3,Isaac Success,Unknown
4,Matthew Pennington,Unknown
...,...,...
1510,Harry Boyes,Unknown
1511,James Bree,Unknown
1512,Felix Nmecha,Unknown
1513,Marek Rodk,Unknown


In [14]:
# create a list of the columns we want and df's for each season
# here we read in the dataframes that we made in the first step that could not be pulled from online originalls

all_cols = ['player','gameweek', 'assists', 'big_chances_created', 'bonus',
       'clean_sheets', 'element', 'fixture', 'fouls', 'goals_conceded',
       'goals_scored', 'minutes', 'opponent_team', 'own_goals',
       'penalties_conceded', 'penalties_missed', 'penalties_saved',
       'red_cards', 'saves', 'selected', 'team_a_score', 'team_h_score',
       'total_points', 'transfers_balance', 'value', 'was_home',
       'yellow_cards']

df = pd.read_csv('https://raw.githubusercontent.com/jsolasole/DS300-Project/main/2016-17.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/jsolasole/DS300-Project/main/2017-18.csv')
df3 = pd.read_csv('https://raw.githubusercontent.com/jsolasole/DS300-Project/main/2018-19.csv')
df4 = pd.read_csv('https://raw.githubusercontent.com/jsolasole/DS300-Project/main/2019-20.csv')
df5 = pd.read_csv('https://raw.githubusercontent.com/jsolasole/DS300-Project/main/2020-21.csv')

df

Unnamed: 0,player,gameweek,assists,big_chances_created,bonus,clean_sheets,element,fixture,fouls,goals_conceded,...,red_cards,saves,selected,team_a_score,team_h_score,total_points,transfers_balance,value,was_home,yellow_cards
0,Clinton_N'Jie,0,0,0,0,0,404,3,0,0,...,0,0,1844,1,1,0,0,60,False,0
1,Clinton_N'Jie,1,0,0,0,0,404,16,0,0,...,0,0,1967,0,1,0,-323,60,True,0
2,Clinton_N'Jie,2,0,0,0,0,404,27,0,0,...,0,0,1849,1,1,0,-299,59,True,0
3,Clinton_N'Jie,3,0,0,0,0,404,37,0,0,...,0,0,1564,4,0,0,-348,59,False,0
4,Clinton_N'Jie,4,0,0,0,0,404,49,0,0,...,0,0,1431,0,1,0,-150,59,True,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23674,Roberto_Firmino,33,0,0,0,0,209,337,1,2,...,0,0,602650,2,1,2,50419,86,True,0
23675,Roberto_Firmino,34,0,1,0,1,209,349,1,0,...,0,0,604262,1,0,3,635,87,False,0
23676,Roberto_Firmino,35,0,0,0,1,209,357,2,0,...,0,0,571569,0,0,3,-33359,87,True,0
23677,Roberto_Firmino,36,0,0,0,0,209,370,0,0,...,0,0,474642,4,0,0,-97334,86,False,0


Here we clean the data and combine it into one final dataframe
- each dataframe is appended a season year before it is added to the final df
- the final two years were missing fouls, penalties, and big chances, so those were filled in with 0's

In [17]:
#combine all the files into one big dataframe

#2017
df = df[all_cols]
#df.assign(season=2017)
df['season'] = 2017

#2018
df2 = df2[all_cols]
df2['season'] = 2018

#2019
df3 = df3[all_cols]
df3['season'] = 2019

#2020
#3 cols none because the values weren't there
df4['fouls'] = [0] * len(df4)
df4['penalties_conceded'] = [0] * len(df4)
df4['big_chances_created'] = [0] * len(df4)
df4 = df4[all_cols]
df4['season'] = 2020

#2021
#3 cols none because the values weren't there
df5['fouls'] = [0] * len(df5)
df5['penalties_conceded'] = [0] * len(df5)
df5['big_chances_created'] = [0] * len(df5)
df5 = df5[all_cols]
df5['season'] = 2021

# put all the data into one final dataframe
df_together = pd.DataFrame()
df_together = df_together.append(df)
df_together = df_together.append(df2)
df_together = df_together.append(df3)
df_together = df_together.append(df4)
df_together = df_together.append(df5)
df_together.set_index("player",inplace=True)
df_together

Unnamed: 0_level_0,gameweek,assists,big_chances_created,bonus,clean_sheets,element,fixture,fouls,goals_conceded,goals_scored,...,saves,selected,team_a_score,team_h_score,total_points,transfers_balance,value,was_home,yellow_cards,season
player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Clinton_N'Jie,0,0,0,0,0,404,3,0,0,0,...,0,1844,1,1,0,0,60,False,0,2017
Clinton_N'Jie,1,0,0,0,0,404,16,0,0,0,...,0,1967,0,1,0,-323,60,True,0,2017
Clinton_N'Jie,2,0,0,0,0,404,27,0,0,0,...,0,1849,1,1,0,-299,59,True,0,2017
Clinton_N'Jie,3,0,0,0,0,404,37,0,0,0,...,0,1564,4,0,0,-348,59,False,0,2017
Clinton_N'Jie,4,0,0,0,0,404,49,0,0,0,...,0,1431,0,1,0,-150,59,True,0,2017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ryan_Fredericks_438,33,0,0,0,0,438,330,0,0,0,...,0,15868,2,1,0,265,42,False,0,2021
Ryan_Fredericks_438,34,0,0,0,0,438,347,0,0,0,...,0,15806,1,0,1,-71,42,True,0,2021
Ryan_Fredericks_438,35,0,0,0,0,438,349,0,0,0,...,0,15946,1,1,0,93,42,False,0,2021
Ryan_Fredericks_438,36,0,0,0,0,438,368,0,0,0,...,0,15877,3,1,0,-61,42,False,0,2021


In [18]:
def clean_name(name):
    """cleans the names of players, removing unnecessary characters
    helpful so that we can combine position and together_df"""
    new_name = ''
    alphabet = [' ','-', "'", '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
    
    for char in name:
        if char.lower() in alphabet:
            new_name += char
    
    #removes the last character if its an underscore
    if new_name[-1] == '_':
        new_name = new_name[:-1]
    
    #replaces all underscores with spaces
    new_name = new_name.replace("_", " ")
    
    return new_name

In [19]:
# Apply the player name function to the dataframe

df_together.reset_index(inplace=True)
df_together['player'] = df_together['player'].apply(clean_name)
df_together

Unnamed: 0,player,gameweek,assists,big_chances_created,bonus,clean_sheets,element,fixture,fouls,goals_conceded,...,saves,selected,team_a_score,team_h_score,total_points,transfers_balance,value,was_home,yellow_cards,season
0,Clinton N'Jie,0,0,0,0,0,404,3,0,0,...,0,1844,1,1,0,0,60,False,0,2017
1,Clinton N'Jie,1,0,0,0,0,404,16,0,0,...,0,1967,0,1,0,-323,60,True,0,2017
2,Clinton N'Jie,2,0,0,0,0,404,27,0,0,...,0,1849,1,1,0,-299,59,True,0,2017
3,Clinton N'Jie,3,0,0,0,0,404,37,0,0,...,0,1564,4,0,0,-348,59,False,0,2017
4,Clinton N'Jie,4,0,0,0,0,404,49,0,0,...,0,1431,0,1,0,-150,59,True,0,2017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114798,Ryan Fredericks,33,0,0,0,0,438,330,0,0,...,0,15868,2,1,0,265,42,False,0,2021
114799,Ryan Fredericks,34,0,0,0,0,438,347,0,0,...,0,15806,1,0,1,-71,42,True,0,2021
114800,Ryan Fredericks,35,0,0,0,0,438,349,0,0,...,0,15946,1,1,0,93,42,False,0,2021
114801,Ryan Fredericks,36,0,0,0,0,438,368,0,0,...,0,15877,3,1,0,-61,42,False,0,2021


Then we bring the original df and the positions df together to create one big dataframe used for calculations

In [20]:
#add the positions into our original dataframe

df_together['position'] = "Unknown"
#For all players in original dataframe
for i in range(len(df_together.index)):
  #search if the player is in the positions dataframe
  search = df_together.iloc[i]['player']
  #If yes, update position
  if search in pos_df['player'].tolist():
    num = pos_df[pos_df['player']==search].index.values
    num = num[0]
    df_together.at[i,'position'] = pos_df.iloc[num]['position']
    #else its stays unknown.
  else :
    df_together.at[i,'position'] = 'Unknown'

In [21]:
#Create new blank form and trend columns
# form will be the average performance from the last four games in all the columns listed below
# trend will be the average difference between the last four gameweeks

df_together["minutes_form"] = [0] * len(df_together)
df_together["minutes_trend"] = [0] * len(df_together)
df_together["assists_form"] = [0] * len(df_together)
df_together["assists_trend"] = [0] * len(df_together)
df_together['big_chances_created_form'] = [0] * len(df_together)
df_together['big_chances_created_trend'] = [0] * len(df_together)
df_together['bonus_form'] = [0] * len(df_together)
df_together['bonus_trend'] = [0] * len(df_together)
df_together['clean_sheets_form'] = [0] * len(df_together)
df_together['clean_sheets_trend'] = [0] * len(df_together)
df_together['fouls_form'] = [0] * len(df_together)
df_together['fouls_trend'] = [0] * len(df_together)
df_together['goals_conceded_form'] = [0] * len(df_together)
df_together['goals_conceded_trend'] = [0] * len(df_together)
df_together['goals_scored_form'] = [0] * len(df_together)
df_together['goals_scored_trend'] = [0] * len(df_together)
df_together['saves_form'] = [0] * len(df_together)
df_together['saves_trend'] = [0] * len(df_together)
df_together['transfers_balance_form'] = [0] * len(df_together)
df_together['transfers_balance_trend'] = [0] * len(df_together)
df_together['yellow_cards_form'] = [0] * len(df_together)
df_together['yellow_cards_trend'] = [0] * len(df_together)
df_together['selected_trend'] = [0] * len(df_together)
df_together['selected_form'] = [0] * len(df_together)
df_together['value_trend'] = [0] * len(df_together)
df_together['value_form'] = [0] * len(df_together)
df_together.dropna(inplace=True)

Next functions are used to calculate the form and the trend of each of the most important stats
- we used threading in order to make the process move faster
- prior to the use of threading, each function would take about 20 minutes to perform
- now they take about 5 minutes each

In [22]:
form_dict = {}
def form_calc(feature):
    """Function that takes in a feature and iterates through the entire df, 
    finding the form of it based on the last four gameweeks"""
    
    form_dict[feature] = []
    
    for i in range(len(df_together)):
        
        # if the gamweek is greater than 4 or we are in the second season of data
        if i % 38 >= 4:

            form = df_together.iloc[i-1][feature]
            form += df_together.iloc[i-2][feature]
            form += df_together.iloc[i-3][feature]
            form += df_together.iloc[i-4][feature]
            form /= 4

            #add the form to the list
            form_dict[feature].append(form)
        
        else:
            form_dict[feature].append(None)

In [23]:
trend_dict = {}

def trend_calc(feature):
    """Function that takes in a feature and iterates through the entire df, 
    finding the trend of it based on the last four gameweeks"""
    
    trend_dict[feature] = []
    for i in range(len(df_together)):
        
        # if the gamweek is greater than 4 or we are in the second season of data
        if i % 38 >= 4:
            
            # find the average slope between most recent games to see if there has been improvement or worsening
            trend = df_together.iloc[i-1][feature] - df_together.iloc[i-2][feature]
            trend += df_together.iloc[i-2][feature] - df_together.iloc[i-3][feature]
            trend += df_together.iloc[i-3][feature] - df_together.iloc[i-4][feature]
            trend /= 3

            # add the form to the list
            trend_dict[feature].append(trend)
        
        else:
            trend_dict[feature].append(None)

In [None]:
# TREND
# using threading so that the calulations for each feature occur 
# simultaniously and can run faster (because this is a lot of data)

import threading
threads = []

features = ['value','assists', 'big_chances_created', 'bonus',
        'clean_sheets', 'fouls', 'goals_conceded',
        'goals_scored', 'minutes', 'penalties_saved', 
        'saves', 'selected', 'transfers_balance','yellow_cards']

#iterate through features
for feature in features:
    print(feature)
    
    t = threading.Thread(target=trend_calc, args=(feature,))
    t.start()
    threads.append(t)

#join the threads at the end
for thread in threads:
    thread.join()
    
print('joined all')

#fill the column
for k,v in trend_dict.items():
    df_together[f'{k}_trend'] = v

value
assists
big_chances_created
bonus
clean_sheets
fouls
goals_conceded
goals_scored
minutes
penalties_saved
saves
selected
transfers_balance
yellow_cards


In [None]:
#FORM

threads = []

#iterate through features
for feature in features:
    print(feature)
    
    t = threading.Thread(target=form_calc, args=(feature,))
    t.start()
    threads.append(t)

#join the threads at the end
for thread in threads:
    thread.join()
print('joined all')

#fill the column
for k,v in form_dict.items():
    df_together[f'{k}_form'] = v

In [None]:
df_together

In [None]:
df_together.to_csv(f'final_df_together.csv', index=True)

---
# 3.3 Model Training
from here on we will show the training of our data and the methods we used to create the best possible model

In [None]:
df_final = df_together[['player', 'gameweek', 'total_points', 'value',
       'position', 'minutes_form', 'minutes_trend',
       'assists_form', 'assists_trend', 'big_chances_created_form',
       'big_chances_created_trend', 'bonus_form', 'bonus_trend',
       'clean_sheets_form', 'clean_sheets_trend', 'fouls_form', 'fouls_trend',
       'goals_conceded_form', 'goals_conceded_trend', 'goals_scored_form',
       'goals_scored_trend', 'saves_form', 'saves_trend',
       'transfers_balance_form', 'transfers_balance_trend',
       'yellow_cards_form', 'yellow_cards_trend', 'selected_trend',
       'selected_form', 'penalties_saved_trend', 'penalties_saved_form', 
       'opponent_team', 'was_home', 'season']]

Create a dictionary with dataframes for each position

Split the data into different positions so that we can specialize four different models and store each in the dictionary

In [None]:
#create a dictionary with dataframes for each position

df_final_gk = df_final[df_final.position == 'GK']
df_final_mid = df_final[df_final.position == 'MID']
df_final_def = df_final[df_final.position == 'DEF']
df_final_att = df_final[df_final.position == 'FWD']

#In doing so we have also ignored players with unkonwn positions and unkonwn values for a lot of data.  
df_dict = {'GOALKEEPERS': [df_final_gk], 'DEFENDERS': [df_final_def], 'MIDFIELDERS': [df_final_mid], 'FORWARDS': [df_final_att]}


append feature and target df's to the dictionary for each position

In [None]:
features_list = ['minutes_form', 'minutes_trend',
       'assists_form', 'assists_trend', 'big_chances_created_form',
       'big_chances_created_trend', 'bonus_form', 'bonus_trend',
       'clean_sheets_form', 'clean_sheets_trend', 'fouls_form', 'fouls_trend',
       'goals_conceded_form', 'goals_conceded_trend', 'goals_scored_form',
       'goals_scored_trend', 'saves_form', 'saves_trend',
       'transfers_balance_form', 'transfers_balance_trend',
       'yellow_cards_form', 'yellow_cards_trend', 'selected_trend',
       'selected_form', 'penalties_saved_trend', 'penalties_saved_form', 
       'opponent_team', 'was_home']
target_list = ['total_points']

# split each df into a feature and a target and add them to the dictionary
# v[0]: original df
# v[1]: feature df
# v[2]: target df
# v[3]: model
# v[4]: selected features

for k,v in df_dict.items():
    df = v[0]
    df.dropna(inplace=True)
    v.append(df[features_list])
    v.append(df[target_list])
    

Now we test the 5 different possible regression models

1. We iterate through each position using all 5 models for each
2. Then we find the model which produces the best test results for each and continue with that one

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor

#create estimators dict to iterate through different models
estimators = {"Linear Regression": LinearRegression(), "Ridge": Ridge(), "Lasso": Lasso(), "kNN": KNeighborsRegressor(), "LinearSVR": LinearSVR()}

In [None]:
def regressors_percentage_split():
    """Iterates through the models for each position dataframe and prints results"""
    from sklearn.metrics import r2_score
    from sklearn.model_selection import train_test_split

    
    for k,v in df_dict.items():
        print(k)

        #split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(v[1], v[2], random_state=3000)

        for k,v in estimators.items():

            #select a regressor and create the model by fitting the training data
            model = v.fit(X=X_train, y=y_train)
            print('\t',k,": ")
            print("\t\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train)))
            print("\t\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test)))

In [None]:
regressors_percentage_split()

Next we scale our data for each of the different positions to see if we can achieve any improvement

In [None]:
def preprocessed_regression():
    """Does the same thing but with scaled data instead"""
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import r2_score
    from sklearn.model_selection import train_test_split

    
    for k,v in df_dict.items():
        print(k)
        X_train, X_test, y_train, y_test = train_test_split(v[1], v[2], random_state=3000)

    
        #create the scaler
        scaler = MinMaxScaler()

        #fit the scaler to the training data(features only)
        scaler.fit(X_train) 

        for k,v in estimators.items():

            #transform X_train and X_test based on the (same) scaler
            X_train_scaled = scaler.transform(X_train) 
            X_test_scaled = scaler.transform(X_test) 

            #select a regressor and create the model by fitting the training data
            model = v.fit(X=X_train_scaled, y=y_train)

            print('\t', k,": ")
            print("\t\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_scaled)))
            print("\t\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_scaled)))


In [None]:
preprocessed_regression()

Finally, with the scaled data and the chosen model (linear Regression) we use RFE feature selection to determine for each position, which features are most important
- For each position, we try all possible i values
- if the i value, produces the best results, we store it as the best and append the i value, as well as its associated test score, training score, model, and features list

In [None]:
def RFE_feature_selection():
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.feature_selection import RFE
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import r2_score
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.model_selection import train_test_split

    
    for k,v in df_dict.items():
        print(k)
        
        X_train, X_test, y_train, y_test = train_test_split(v[1], v[2], random_state=3000)

        #create the scaler
        scaler = MinMaxScaler()

        #fit the scaler to the training data(features only)
        scaler.fit(X_train) 

        #transform X_train and X_test based on the (same) scaler
        X_train_scaled = scaler.transform(X_train) 
        X_test_scaled = scaler.transform(X_test)
        
        best_i = 0
        best_score = 0
        best_train = 0
        features = []
        best_model = 0

        for i in range(26):
            
            select = RFE(DecisionTreeRegressor(random_state=3000), n_features_to_select=i+1)

            #fit the RFE selector to the training data
            select.fit(X_train, y_train)

            #transform training and testing sets so only the selected features are retained
            X_train_selected = select.transform(X_train_scaled)
            X_test_selected = select.transform(X_test_scaled)


            model = LinearRegression().fit(X=X_train_selected, y=y_train)

            test_score = r2_score(y_test, model.predict(X_test_selected))
            train_score = r2_score(y_train, model.predict(X_train_selected))
            
            # if the test performs better than the previous best, update all of the 'best' variables
            if test_score > best_score:
                features = []
                print(i, ': ', test_score)
                best_score = test_score
                best_train = train_score
                best_i = i
                best_model = model
                
                for i in range(len(list(v[1].columns))):
                    if list(select.get_support())[i] == True:
                        features.append(list(v[1].columns)[i])
        
        #print out all the best variables and 
        v.append(best_model)
        v.append(features)
        print(f"\tWith {i} selected features:")
        print("\tSelected Features: ")
        for i in range(len(list(v[1].columns))):
            if list(select.get_support())[i] == True:
                print("\t\t", list(v[1].columns)[i])
        print("\t\tR-squared value for training set: ", best_train)
        print("\t\tR-squared value for testing set: ", best_score)

- output shows the progression of the best i value for each position, the selected features, and the rsquared values for the training and testing set
- the output shows no sign of over or underfitting
- the tuning of the model helped us to avoid over and underfitting
    - it also helped is refine our model to make it more accurate with its predicitons

In [None]:
RFE_feature_selection()

# 3.5 Model Testing
From here out, we have created a few functions which will show the output of our predictions. the functions below output both dataframes and actual teams which show the success of our model

In [None]:
def predict_players(pos, season, gameweek):
    """function which takes in the position, season, and gameweek, and returns a dataframe 
    with the predicted and actual scores of each player in that position"""
    from sklearn.preprocessing import MinMaxScaler
    
    #clean our dataframe
    features = df_dict[pos][4]
    test = df_dict[pos][0]
    test = test[test.gameweek==gameweek]
    test = test[test.season==season]
    original = test
    test = test[features]
    
    #scale it to our model
    #create the scaler
    scaler = MinMaxScaler()

    #fit the scaler to the training data(features only)
    scaler.fit(test) 

    #transform X_train and X_test based on the (same) scaler
    scaled = scaler.transform(test) 

    model = df_dict[pos][3]
    
    predicted = []
    for val in model.predict(scaled):
        predicted.append(val.tolist()[0])
        
    
    original = original[['player','season','total_points']]
    original['predicted'] = predicted

    return original

In [None]:
def predict_gameweek(DEF, MID, FWD, season, gameweek):
    """function which takes in the position, season, and gameweek, and prints out the  
    the predicted and actual scores of the top predicted players in each position"""
    def_list = []
    mid_list = []
    fwd_list = []
    actual= 0
    predicted = 0
    gk = ''
    
    # Predicting values for Goalkeepers
    df = predict_players('GOALKEEPERS', season, gameweek)
    df = df.sort_values(by=['predicted'], ascending=False)
    gk = f'{df.player.iloc[0]}: [{df.total_points.iloc[0]}] {df.predicted.iloc[0]}'
    actual+=df.total_points.iloc[0]
    predicted+=df.predicted.iloc[0]
    # Predicting values for Defenders
    df = predict_players('DEFENDERS', season, gameweek)
    df = df.sort_values(by=['predicted'], ascending=False)
    for i in range(DEF):
        def_list.append(f'{df.player.iloc[i]}: [{df.total_points.iloc[i]}] {df.predicted.iloc[i]}')
        actual+=df.total_points.iloc[i]
        predicted+=df.predicted.iloc[i]
    # Predicting values for Midfielders
    df = predict_players('MIDFIELDERS', season, gameweek)
    df = df.sort_values(by=['predicted'], ascending=False)
    for i in range(MID):
        mid_list.append(f'{df.player.iloc[i]}: [{df.total_points.iloc[i]}] {df.predicted.iloc[i]}')
        actual+=df.total_points.iloc[i]
        predicted+=df.predicted.iloc[i]
    # Predicting values for Forwards
    df = predict_players('FORWARDS', season, gameweek)
    df = df.sort_values(by=['predicted'], ascending=False)
    for i in range(FWD):
        fwd_list.append(f'{df.player.iloc[i]}: [{df.total_points.iloc[i]}] {df.predicted.iloc[i]}')
        actual+=df.total_points.iloc[i]
        predicted+=df.predicted.iloc[i]
    
    print(f'FORMATION: {DEF}-{MID}-{FWD}')
    print('[]: actual score')
    
    print('\nGOALKEEPER')
    print(gk)
    
    print('\nDEFENSE')
    for defender in def_list:
        print(defender)
        
    print('\nMIDFIELD')
    for mid in mid_list:
        print(mid)
        
    print('\nATTACK')
    for fwd in fwd_list:
        print(fwd)
    
    print('\nFINAL CALCULATIONS')
    print(f'Predicted points: {predicted}')
    print(f'Actual points:    {actual}')
    print(f'Difference:       {actual-predicted}')
    
    return (predicted, actual)

In [None]:
def actual_gameweek(DEF, MID, FWD, season, gameweek):
    """function which takes in the position, season, and gameweek, and prints out the  
    the actual scores of the top players in each position"""
    def_list = []
    mid_list = []
    fwd_list = []
    actual = 0
    gk = ''
    # Predicting values for Goalkeepers
    df = predict_players('GOALKEEPERS', season, gameweek)
    df = df.sort_values(by=['total_points'], ascending=False)
    gk = f'{df.player.iloc[0]}: {df.total_points.iloc[0]}'
    actual+=df.total_points.iloc[0]
    # Predicting values for Defenders
    df = predict_players('DEFENDERS', season, gameweek)
    df = df.sort_values(by=['total_points'], ascending=False)
    for i in range(DEF):
        def_list.append(f'{df.player.iloc[i]}: {df.total_points.iloc[i]}')
        actual+=df.total_points.iloc[i]
     # Predicting values for Midfielders
    df = predict_players('MIDFIELDERS', season, gameweek)
    df = df.sort_values(by=['total_points'], ascending=False)
    for i in range(MID):
        mid_list.append(f'{df.player.iloc[i]}: {df.total_points.iloc[i]}')
        actual+=df.total_points.iloc[i]
    # Predicting values for Forwards
    df = predict_players('FORWARDS', season, gameweek)
    df = df.sort_values(by=['total_points'], ascending=False)
    for i in range(FWD):
        fwd_list.append(f'{df.player.iloc[i]}: {df.total_points.iloc[i]}')
        actual+=df.total_points.iloc[i]
    
    print(f'FORMATION: {DEF}-{MID}-{FWD}')
    
    print('\nGOALKEEPER')
    print(gk)
    
    print('\nDEFENSE')
    for defender in def_list:
        print(defender)
        
    print('\nMIDFIELD')
    for mid in mid_list:
        print(mid)
        
    print('\nATTACK')
    for fwd in fwd_list:
        print(fwd)
        
    print(f'Total points: {actual}')
    return actual

# 3.2 
## Visualization 1

Using the predict_gameweek function and actual_gameweek function we display the predicted and actual scores of the best team of a given gameweek. The visualization outputs the team in a format similar to how it would appear in a fantasy app. This would be a very useful visualization for anyone looking to use our model to predit their own fantasy team

In [None]:
predict_gameweek(4,3,3, 2020, 32)
print('\n\n')
actual_gameweek(4,3,3, 2020, 32)

In [None]:
def predictions_dataframe(DEF, MID, FWD, YEAR):
    output_dict = {'predicted': [], 'actual': [], 'best': []}

    for i in range(38):
        predicted, actual = predict_gameweek(DEF, MID, FWD, YEAR, i)
        best = actual_gameweek(DEF, MID, FWD, YEAR, i)
        output_dict['predicted'].append(predicted)
        output_dict['actual'].append(actual)
        output_dict['best'].append(best)

    df = pd.DataFrame.from_dict(output_dict)
    df.reset_index(inplace = True)
    df = df.rename(columns={'index':'gameweek'})
    return df

In [None]:
df = predictions_dataframe(4,3,3, 2021) 

In [None]:
df

## Visualization 2
This visualization shows the difference across all gameweeks in the 2021 season
* one line shows the change in our predicted best team
* one shows the actual scores of each of our visualizations
* the last line shows the actual best team for that gameweek

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
df_melted = pd.melt(df, id_vars=['gameweek'], value_vars=['predicted', 'actual', 'best'])


fig_dims = (12, 8)
fig, ax = plt.subplots(figsize=fig_dims)

#putting a df directly into seaborn
graph = sns.lineplot(x='gameweek', y='value', hue='variable', data=df_melted)

#Title
title = "Predicted and Actual Points of Predicted Best with Actual Best Team (2020)"
graph.set_title(title)

#labels
graph.set_xlabel("Gameweek", size = 16)
graph.set_ylabel("Team Points", size = 16)

## Visualization 3
This visualization shows the difference across all gameweeks in the 2019 season

In [None]:
df = predictions_dataframe(4,3,3, 2020)

In [None]:
df_melted = pd.melt(df, id_vars=['gameweek'], value_vars=['predicted', 'actual', 'best'])

fig_dims = (12, 8)
fig, ax = plt.subplots(figsize=fig_dims)

#putting a df directly into seaborn
graph = sns.lineplot(x='gameweek', y='value', hue='variable', data=df_melted)

#Title
title = "Predicted and Actual Points of Predicted Best with Actual Best Team (2020)"
graph.set_title(title)

#labels
graph.set_xlabel("Gameweek", size = 16)
graph.set_ylabel("Team Points", size = 16)

## Visualization 4
This visualization shows the difference across all gameweeks in the 2018 season

In [None]:
df = predictions_dataframe(4,3,3, 2019)

In [None]:
df_melted = pd.melt(df, id_vars=['gameweek'], value_vars=['predicted', 'actual', 'best'])

fig_dims = (12, 8)
fig, ax = plt.subplots(figsize=fig_dims)

#putting a df directly into seaborn
graph = sns.lineplot(x='gameweek', y='value', hue='variable', data=df_melted)

#Title
title = "Predicted and Actual Points of Predicted Best with Actual Best Team (2020)"
graph.set_title(title)

#labels
graph.set_xlabel("Gameweek", size = 16)
graph.set_ylabel("Team Points", size = 16)

## Visualization 5
This visualization shows the difference across all gameweeks in the 2017 season

In [None]:
df = predictions_dataframe(4,3,3, 2018)

In [None]:
df_melted = pd.melt(df, id_vars=['gameweek'], value_vars=['predicted', 'actual', 'best'])

fig_dims = (12, 8)
fig, ax = plt.subplots(figsize=fig_dims)

#putting a df directly into seaborn
graph = sns.lineplot(x='gameweek', y='value', hue='variable', data=df_melted)

#Title
title = "Predicted and Actual Points of Predicted Best with Actual Best Team (2020)"
graph.set_title(title)

#labels
graph.set_xlabel("Gameweek", size = 16)
graph.set_ylabel("Team Points", size = 16)

## Visualization 6
This visualization shows the difference across all gameweeks in the 2016 season

In [None]:
df = predictions_dataframe(4,3,3, 2017)

In [None]:
df_melted = pd.melt(df, id_vars=['gameweek'], value_vars=['predicted', 'actual', 'best'])

fig_dims = (12, 8)
fig, ax = plt.subplots(figsize=fig_dims)

#putting a df directly into seaborn
graph = sns.lineplot(x='gameweek', y='value', hue='variable', data=df_melted)

#Title
title = "Predicted and Actual Points of Predicted Best with Actual Best Team (2020)"
graph.set_title(title)

#labels
graph.set_xlabel("Gameweek", size = 16)
graph.set_ylabel("Team Points", size = 16)

# 4. Discussion

* For ML, we used 4 specific algorithms, Linear Regression, Ridge, Knn, and Lasso. 
* From these specific algorithms, In all categories there was minor difeerence between linear regression and and ridge - however Linear regression seemed to have the slight edge in terms of the best performing algorithm. 


* Feature selection led to a small increase in the performance (less that 1%) 
    - each position had a different amount of features to produce the best results
    - e.g. the goalkeeping model had 25 features selected, while the midfielders had 27 to produce the best results


* We believe that our model is a good predictor for our outcome variable of toal points. As all of our group members know from personal experience, predicting who is going to do best in the unpredictable world of soccer is very difficult. Being able to pick the best team with avg 25% accuracy is very impressive, and can certainly be used as a tool for any fantasy sports users


* We all felt that the accuracy of the model was very low partially due to unknown data for a lot of the players, furthermore since the sport is unpredictable, due to injuries, game bans ,and player fitness - finding a way to incorporate the news could help us imporve the accuracy. On top of that, player form naturally goes up and down, just because a player scores 3 goals in one game, that certainly doesnt mean they will score even one in the next. In the future, obtaining better data and a way to incorporate more external factors (data not taken within the game, ie. injury record, team rotations, etc.) data could be helpful in understanding and predicting better. Often, teams will put out before a game is played whether or not a player will be playing at all, this would be key in improving our standings. Game frequency also has a large impact on performance. 90 minuted of running is tiring for players, so teams with a week of rest will often outplay teams with 3 days of rest simply because of that factor. With so many variables in the world of sports, there are many other things to include that would only improve our model