# Team selection for English Premier League Fantasy Football

After signing up to https://fantasy.premierleague.com/ and going to "squad" selection, we see that we have to choose a total of 15 players (2 goalkeepers, 5 defenders, 5 midfielders and 3 forwards) to make up our team. We will simplify this problem to trying to choose the best team of 11 (i.e. without subs) players with the highest total score without exceeding our budget of 100 pounds (and just pick the cheapest players as subs for each position).

## Getting the player data from the website

Firstly we start off by saving a bunch of html files. Under "Player selection", we can select what players to view. What we want to do is save the data by position for each position. So firstly, we choose "Goalkeepers" and we then save the webpage as "GK1.html". Then, at the bottom of the "Player selection" table, we can click on the arrow to go to the next page and then save it again as "GK2.html". Next we move on to choosing "Defenders" only and repeat the process (so we should have "D1.html" to "D6.html"). And we repeat the process for the midfielders (giving us "MF1.html" to "MF8.html") and forwards (giving us "F1.html" to "F3.html"). Make sure that these files are saved in the same location as this notebook. **Note:** It's important that we leave the "Sorted by" field as *Total Score* and "With a maximum price of" as *Unlimited*. This is because we want all the players in order of their total score to run the code. And it is also important that you make sure your squad is empty. Otherwise the program will add some extra rows to the data (it messes up the scraping). I used google chrome but I think it should work with other browsers too.

After getting the data, we rename all ".html" files as ".txt" (e.g. "GK1.txt", "GK2.txt", "D1.txt" etc.). All the folders with the additional data created by your web browser when saving the webpage can be deleted.

## Choosing the "best" team
First we load all the required python libraries and define our constants.

In [47]:
### Libraries ###

import sys #used for input argument
import re #regular expressions
import csv #to write to csv
import numpy as np #to change list into matrix
import itertools #to iterate through different combinations of players

### Constants ###

POSITIONS=['GK','D','MF','F']
NOFILES=[2,6,8,3]
TOT_NUMBER=[1,4,4,2]
MAX_COST=100

Next, we take all the html data that we have (already saved as .txt files) and we convert them into matrices (1 for each position). So it looks like:

Goalies:

[[Player Status, Name, Price, Total Score, 'GK'],
[Player Status, Name, Price, Total Score, 'GK'],
.
.
.]

Defenders:

[[Player Status, Name, Price, Total Score, 'D'],
[Player Status, Name, Price, Total Score, 'D'],
.
.
.]

etc.

We save all these matrices into one dictionary called "tables" where each table has a key of its position (i.e. 'GK','D','MF','F').

In [48]:
### Convert from html to numpy matrices ###

# Function: removeChars
# Usage: number_str = removeChars(str)
# Description: takes in a string and delets any character that's not a
# number or a decimal point.
######################################################################
def removeChars(string):
    non_decimal=re.compile(r'[^\d.]+')
    clean_string=non_decimal.sub('',string)
    return clean_string

# Function: makeRow
# Usage: statuses = makeRow(filename,str_bef,str_aft)
# Description: Finds all the strings written in between str_bef and
# str_aft and returns an array of those strings.
#######################################################################
def makeRows(filename,str_bef,str_aft):
    start=len(str_bef)
    rows=[]
    file = open(filename)
    for line in file:
        if str_bef in line:
            end=line.index(str_aft)
            row=line[start:end]+','
            row=row.replace("–", "-") #a hacky way to remove a non-recognized character by np.savetxt
            rows.append(row)
    return rows

# Function Family: find___
# Usage: ___ = find___(filename)
# Descriptions: Given a txt file of an team selection html page, it
# will find all the player ___ (e.g. statuses, names) for the week and 
# make an array of them.
#######################################################################
def findStatuses(filename):
    str_bef='            <a href="https://fantasy.premierleague.com/a/squad/selection#" class="ismjs-info ism-table--el__status-link" title="'
    str_aft='"><svg class="ism-icon--element'
    statuses=makeRows(filename,str_bef,str_aft)
    return statuses
    
def findNames(filename):
	str_bef='                <a href="https://fantasy.premierleague.com/a/squad/selection#" class="ism-table--el__name">'
	str_aft='</a>'
	names=makeRows(filename,str_bef,str_aft)
	return names

def findPricesAndScores(filename):
    str_bef='    <td class="ism-table--el__strong">'
    str_aft='</td>'
    pricesAndScores=makeRows(filename,str_bef,str_aft)
    for i in range(0,len(pricesAndScores)):
        pricesAndScores[i]=removeChars(pricesAndScores[i]) #removes pound signs and other random entries that happened to have same bef and aft strings
    pricesAndScores=[x for x in pricesAndScores if x != ''] #remove blank entries (which show up for some reason)
    return pricesAndScores

# Function: onlyPrices[Scores](pricesAndScores)
# Usage: prices[scores]= onlyPrices[Scores](pricesAndScores)
# Descriptions: Takes in an array of pricesAndScores (where the prices)
# and scores are back to back) and selects out only the price[score].
# This is used because the price and score of the players of the same 
# html tags before and after.
#######################################################################
def onlyPrices(pricesAndScores):
    prices=[]
    for i in range(0,len(pricesAndScores)):
        if i % 2 == 0:
            prices.append(pricesAndScores[i]+',')
    return prices

def onlyScores(pricesAndScores):
    scores=[]
    for i in range(0,len(pricesAndScores)):
        if i % 2 == 1:
            scores.append(pricesAndScores[i]+',')
    return scores

# Function: makeTable
# Usage: table = makeTable(filename,'GK')
# Descriptions: Takes in a html file (in txt format) and position and
# makes a table with columns 1) Player status, 2) Name, 3) Price, 4) 
# Total score and 5) Position.
#######################################################################
def makeTable(filename,position):
    table=[]
    #Find relevant fields and append them to the table
    statuses=findStatuses(filename)
    table.append(statuses)
    names=findNames(filename)
    table.append(names)
    pricesAndScores=findPricesAndScores(filename)
    prices=onlyPrices(pricesAndScores)
    table.append(prices)
    scores=onlyScores(pricesAndScores)
    table.append(scores)
    positions= [position]*len(statuses)
    table.append(positions)
    return table

# Function: rowBind
# Usage: full_table = rowBind(table1,table2)
# Descriptions: Takes 2 tables and combines them into 1 by row.
#######################################################################
def rowBind(table1,table2):
    if len(table1)==0:
        return table2
    for i in range(0,len(table1)):
            table1[i]=table1[i]+table2[i]
    return table1

## Main function ##
tables={}
#Make a table for each position and add it to tables
for i in range(0,len(POSITIONS)):
    #Initialize variables
    position = POSITIONS[i]
    noFiles= NOFILES[i]
    table=[]
    #Make a table for each file and then combine them
    for j in range(0,noFiles):
        filename = position+str(j+1)+'.txt'
        cur_table=makeTable(filename,position)
        table=rowBind(table,cur_table)
    #Need to make it into a numpy array to be able to manipulate it easily
    table=np.array(table)
    table=np.transpose(table)
    np.savetxt(position+'.csv', table, fmt='%s %s %s %s %s',delimiter=',') #save as csv (not really needed)
    tables[POSITIONS[i]]=table #add to tables

Now that we have all the data in a nice table, we can create teams that cost less than 100 and see which one has the highest total score. However there are a few players we can get rid of before doing that.

The idea is that if we only need 1 goalkeeper, we should only keep the 1 goalkeeper with the highest total score for each price (you would always rather choose a goalkeeper with a higher score for the same price, and you can choose a maximum of 1). Similarly for defenders we'll keep the top 4 per price, and so on for the other positions. We're also not considering players that don't have a usual chance of playing (so only players for which it says 'View player information,').

In [49]:
### Picking top players per price ###

#Initialize variables
short_tables={} # Final list of teams
#For each position, pick the top 5 per price (note that the tables are already ordered by total points, so we can)
#just pick the first 5 per price
for i in range(0,len(POSITIONS)):
    table=tables[POSITIONS[i]]
    short_table=[]
    prices=set(table[:,2]) #list of unique prices
    for price in prices:
        counter=0 #when we've reached the max number of players, we have to leave the loop
        for j in range(0,len(table)):
            isPlaying = (table[j,0]=='View player information,')
            if (table[j,2]==price and isPlaying):
                table[j,2]=removeChars(table[j,2]) #need to remove pound signs and commas from price
                table[j,3]=removeChars(table[j,3]) #need to remove commas from total points
                short_table.append(table[j,:])
                counter+=1
            if counter >= TOT_NUMBER[i]:
                break
    short_tables[POSITIONS[i]]=np.asmatrix(short_table)

Now that we've only kept the useful players, we'll find out all the possible teams that we can make that cost less than 100 using a recursive method and then keep the team with the most points. We note that we need to choose 1 extra player per position (i.e. the substitutes), but we don't want to use them so to take them into account we find the minimum price per position and add it to the cost. There are 2 ways in going about this. One is the method below, which has many nested for loops and isn't very elegant but gets the job done. It does take a while to run! (on the order of a day maybe?)

In [None]:
# Function: findMinCosts
# Usage: teams = minCosts = findMinCosts(tables)
# Descriptions: Given the table of players returns the minimum cost
# per position
#######################################################################
def findMinCosts(tables):
    minCosts={}
    for position in POSITIONS:
        table=tables[position]
        costs=[]
        for row in table:
            costs.append(float(removeChars(row[2])))
        minCosts[position]=min(costs)
    return minCosts

## Max costs of players without subs
minCosts=findMinCosts(tables)
totalMinCost=0
for position in minCosts:
    totalMinCost+=float(minCosts[position])
maxCost=MAX_COST-totalMinCost

best_team=[]
cost=0
max_score=0
for GK in itertools.combinations(range(0,len(short_tables['GK'])), TOT_NUMBER[0]):
    table=short_tables['GK']
    costGK = cost + sum(table[GK,2].astype(np.float))
    teamGK=table[GK,:] #Add players to team
    for D in itertools.combinations(range(0,len(short_tables['D'])), TOT_NUMBER[1]):
        table=short_tables['D']
        costD = costGK + sum(table[D,2].astype(np.float))
        teamD=np.append(teamGK,table[D,:],axis=0) #Add players to team
        for MF in itertools.combinations(range(0,len(short_tables['MF'])), TOT_NUMBER[2]):
            table=short_tables['MF']
            costMF = costD + sum(table[MF,2].astype(np.float))
            teamMF=np.append(teamD,table[MF,:],axis=0) #Add players to team
            for F in itertools.combinations(range(0,len(short_tables['F'])), TOT_NUMBER[3]):
                table=short_tables['F']
                costF = costMF + sum(table[F,2].astype(np.float))
                teamF=np.append(teamMF,table[F,:],axis=0) #Add players to team
                score=np.sum(teamF[:,3].astype(np.float))
                if (score > max_score and costF <=maxCost):
                    max_score=score
                    best_team=teamF

                

In [None]:
print(best_team)

The other way is a recursive method. However it did not run in my computer due to a memory error, and I'm not sure how powerful a computer you need for this to run properly.

In [None]:
### Find all teams that cost less than 100 ###

# Function: findMinCosts
# Usage: teams = minCosts = findMinCosts(tables)
# Descriptions: Given the table of players returns the minimum cost
# per position
#######################################################################
def findMinCosts(tables):
    minCosts={}
    for position in POSITIONS:
        table=tables[position]
        costs=[]
        for row in table:
            costs.append(float(removeChars(row[2])))
        minCosts[position]=min(costs)
    return minCosts

# Function: makeTeams
# Usage: teams = makeTeams(initCost,initPos,team,teams)
# Descriptions: Given initial cost (usually 0) and initial position 
# number (usually also 0 i.e. 'GK'), initial (usually empty) team
# (np.empty((0,5), int)), initial teams list (usually {} i.e. empty) 
# and the maximumum amount of money you can spend. This will spit out 
# all the possible teams that cost less than maxCost from the list of 
# players called 'short_table'.
#######################################################################
def makeTeams(total_cost,positionNo,team,teams,maxCost):
    #If we've gone through all the positions, append the team to the list and return the list
    if positionNo > 3:
        teams.append(team)
        return teams
    else:
        #Keep a record of the current teams costs and position numbers for iterations
        orig_team=team
        orig_cost=total_cost
        orig_positionNo=positionNo
        table=short_tables[POSITIONS[positionNo]]
        # loop through each combination of n players (where n depends on the position e.g. for 'GK' n=1) 
        for subset in itertools.combinations(range(0,len(table)), TOT_NUMBER[positionNo]):
            team=orig_team
            positionNo=orig_positionNo
            #Find cost of players and add to total cost
            cost = sum(table[subset,2].astype(np.float))
            total_cost+=cost
            if total_cost <= maxCost:
                team=np.append(team,table[subset,:],axis=0) #Add players to team
                positionNo+=1 #Go to next position
                teams=makeTeams(total_cost,positionNo,team,teams,maxCost) #Find players for next position
                total_cost-=cost
            else:
                total_cost-=cost
        return teams

## Main Function ##

## Max costs of players without subs
minCosts=findMinCosts(tables)
totalMinCost=0
for position in minCosts:
    totalMinCost+=float(minCosts[position])
maxCost=MAX_COST-totalMinCost

initCost=0;
initPosNo=0
teams=[]
team=np.empty((0,5), int)
teams=makeTeams(initCost,initPosNo,team,teams,maxCost)

In [None]:
### Find the team with the highest total score ###

#Initialize variables
max_score=0
best_team=[]
#Iterate through teams, find total score, and keep best team
for team in teams:
    score=np.sum(team[:,3].astype(np.float))
    if score > max_score:
        best_team=team
        max_score=score
print(best_team)