In what follows we will attempt to identify players for future scouting. We will focus on forwards for
simplicity.

In [1]:
import pandas as pd
import numpy as np
import warnings
import os
import pathlib
import json

Import FBRef data from the top 5 leagues in the 20-21 season

In [2]:
data = pd.read_csv("FBRef_20-21_T5_Data.csv")
#Restrict to players who have played at least 500 minutes
data["Min"] = data["Min"].apply(lambda x: x.replace(",","")).astype(int)
#clean the data set
data1 = data.loc[data["Min"]> 500]
data1 = data1.reset_index(drop = True)
data1 = data1.fillna(0)
data1 = data1.loc[data1["Player"] != "Salvador Ferrer"]
data1 = data1.loc[data1["Player"] != "Jota"]

In [10]:
#extract the forwards
forwards = data1.loc[(data1['Pos'] == 'FW') | (data1['Pos'] == 'MF,FW') | (data1['Pos'] == 'FW,MF')]
forwards = forwards.reset_index()
#remove names, teams, etc.
AttFeat = forwards.iloc[:, 12:]

This data set is still fairly high dimensional and contains extraneous data like number of nutmegs for example.
We will select features we think are relevant to identifying the play-style of forwards. In particular, we will
restrict to per90 statistics, remove extraneous datapoints like the aformentioned nutmegs, and we will
remove stats that indicate player quality like xG/90. This is because we only want to measure play-style, not player quality.

Player quality is obviously incredibly important when identifying players for transfers, but we don't want players to be artificially identified because they both have similar production even though they play very different styles. For example, if you include the xG data in what follows the algorithm will spit out Danny Ings as one of the most similar players to Michail Antonio. Anyone who has watched these two players will recognize how different they are as strikers despite now playing for the same club, 3 years after this data was collected.

In [11]:
AFdf = AttFeat.iloc[:, [11, 15, 17, 19, 20, 23, 26, 29, 30, 31, 32, 33, 38, 39, 54, 55, 56, 57, 58, 59, 69, 70, 71, 76, 77, 79, 80, 81, 91, 92, 93, 105, 107, 108, 109, 110, 111, 112, 116]]
#Lets see the list of features
AFdf.columns

Index(['AvgShotDist', 'PassCmp/90', 'PassCmp%', 'PrgDistPass/90',
       'ShortCmp/90', 'MedCmp/90', 'LongCmp/90', 'KeyPass/90',
       'PassIntoThird/90', 'PassIntoBox/90', 'CrossIntoBox/90', 'ProgPass/90',
       'PassUnderPress/90', 'Switches/90', 'PassLiveSCA/90', 'PassDeadSCA/90',
       'DribSCA/90', 'ShSCA/90', 'FoulSCA/90', 'DefSCA/90', 'Def 3rdTkl/90',
       'Mid 3rdTkl/90', 'Att 3rdTkl/90', 'PressAtt/90', 'SuccPress/90',
       'Def 3rdPress/90', 'Mid 3rdPress/90', 'Att 3rdPress/90',
       'Mid 3rdTchs/90', 'Att 3rdTchs/90', 'Att PenTchs/90', 'Carries/90',
       'PrgDistCarry/90', 'ProgCarry/90', 'CarryIntoThird/90',
       'CarryIntoBox/90', 'Miscontrol/90', 'Dispossessed/90',
       'ProgPassReceived/90'],
      dtype='object')

Let's now try the most naive way of identifying new players. We will pick a player whose play-style we like. For our purposes it will be Michail Antonio of West Ham United. Let's then just find the 5 players whose stats are closest to his.

Note that we haven't scaled the data at all yet.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

In [12]:
neigh = NearestNeighbors(n_neighbors = 5)
neigh.fit(AFdf)

In [13]:
#We first need to find the index of Michail Antonio
forwards.index[forwards['Player'] == 'Michail Antonio']

Index([8], dtype='int64')

In [14]:
#Now lets get Antonio's stats in a proper format
Antonio = AFdf.loc[8].to_numpy().reshape(1,-1)

In [15]:
#Find the 5 nearest neighbors to Antonio
neigh.kneighbors(Antonio)



(array([[ 0.        , 11.93583261, 13.46652888, 13.74782528, 14.12232984]]),
 array([[  8,  14, 116, 174, 124]], dtype=int64))

In [104]:
#Now we have the indices of the players. Who are they?
forwards.loc[[8, 14, 116, 174, 124]]

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PrgDistCarry/90,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90
8,772,Michail Antonio,eng ENG,FW,West Ham,eng Premier League,30.0,1990.0,26,24,...,84.4,4.89,1.51,1.23,3.52,2.6,55.7,29.3,52.6,8.77
14,778,Keita Baldé,sn SEN,FW,Sampdoria,it Serie A,25.0,1995.0,25,12,...,82.4,4.59,1.2,1.65,3.16,1.65,55.6,30.3,54.5,10.2
116,881,Arnaud Kalimuendo,fr FRA,FW,Lens,fr Ligue 1,18.0,2002.0,28,13,...,89.4,5.19,1.41,1.33,2.22,3.26,50.1,28.4,56.8,9.48
174,939,Karim Onisiwo,at AUT,FW,Mainz 05,de Bundesliga,28.0,1992.0,31,19,...,82.0,4.1,1.37,1.07,3.85,2.88,51.9,25.9,49.9,8.34
124,889,Randal Kolo Muani,fr FRA,FW,Nantes,fr Ligue 1,21.0,1998.0,37,35,...,78.4,4.07,0.8,1.36,3.06,2.76,45.5,23.2,51.0,7.57


In [17]:
#Let's repeat with Messi
forwards.index[forwards['Player'] == 'Lionel Messi']

Index([360], dtype='int64')

In [18]:
Messi = AFdf.loc[360].to_numpy().reshape(1,-1)
neigh.kneighbors(Messi)



(array([[ 0.        , 26.91673086, 43.59618447, 62.23733365, 63.8992856 ]]),
 array([[360, 556, 524, 302, 316]], dtype=int64))

In [20]:
forwards.loc[[556, 524, 302, 316]]

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PrgDistCarry/90,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90
556,1905,Neymar,br BRA,"MF,FW",Paris S-G,fr Ligue 1,28.0,1992.0,18,15,...,245.3,13.8,4.59,1.46,4.01,4.33,82.2,70.5,85.8,11.7
524,1873,Isco,es ESP,"MF,FW",Real Madrid,es La Liga,28.0,1992.0,25,8,...,219.9,11.0,3.37,0.69,2.18,1.49,78.5,70.5,89.8,5.35
302,1078,Papu Gómez,ar ARG,"FW,MF",Atalanta,it Serie A,32.0,1988.0,10,9,...,218.6,11.6,4.86,0.41,1.08,1.35,63.6,54.9,86.2,5.41
316,1092,Josip Iličić,si SVN,"FW,MF",Atalanta,it Serie A,32.0,1988.0,28,17,...,198.5,12.1,4.0,2.12,2.88,3.76,79.1,64.2,81.1,12.1


We now recall that we didn't scale our data at the beginning. Let's now scale the data and redo the analysis.

In [186]:
scaler = StandardScaler()
AFdf_scaled = scaler.fit_transform(AFdf)

neigh_scaled = NearestNeighbors(n_neighbors = 5)
neigh_scaled.fit(AFdf_scaled)

In [99]:
neigh_scaled.kneighbors(AFdf_scaled[8].reshape(1,-1), return_distance = False)

array([[  8,  64, 241,  43, 166]], dtype=int64)

In [101]:
forwards.loc[[8, 64, 241, 43, 166]]

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PrgDistCarry/90,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90
8,772,Michail Antonio,eng ENG,FW,West Ham,eng Premier League,30.0,1990.0,26,24,...,84.4,4.89,1.51,1.23,3.52,2.6,55.7,29.3,52.6,8.77
64,828,Edin Džeko,ba BIH,FW,Roma,it Serie A,34.0,1986.0,27,20,...,63.4,3.84,1.23,0.99,3.0,2.07,49.8,29.9,60.0,9.9
241,1006,Callum Wilson,eng ENG,FW,Newcastle Utd,eng Premier League,28.0,1992.0,26,23,...,59.2,3.06,1.16,1.03,2.93,2.28,47.3,19.8,41.9,6.55
43,807,Jhon Córdoba,co COL,FW,Hertha BSC,de Bundesliga,27.0,1993.0,21,17,...,44.3,2.22,0.94,0.88,2.87,1.99,47.4,23.1,48.8,7.49
166,931,Ibrahima Niane,sn SEN,FW,Metz,fr Ligue 1,21.0,1999.0,10,9,...,36.4,2.3,0.54,0.41,4.59,2.03,47.7,27.0,56.7,10.1


In [102]:
neigh_scaled.kneighbors(AFdf_scaled[360].reshape(1,-1), return_distance = False)

array([[360, 556, 302, 316, 487]], dtype=int64)

In [105]:
forwards.loc[[360, 556, 302, 316, 487]]

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PrgDistCarry/90,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90
360,1136,Lionel Messi,ar ARG,"FW,MF",Barcelona,es La Liga,33.0,1987.0,35,33,...,246.1,15.4,5.71,1.46,1.19,2.83,87.2,74.7,85.7,8.27
556,1905,Neymar,br BRA,"MF,FW",Paris S-G,fr Ligue 1,28.0,1992.0,18,15,...,245.3,13.8,4.59,1.46,4.01,4.33,82.2,70.5,85.8,11.7
302,1078,Papu Gómez,ar ARG,"FW,MF",Atalanta,it Serie A,32.0,1988.0,10,9,...,218.6,11.6,4.86,0.41,1.08,1.35,63.6,54.9,86.2,5.41
316,1092,Josip Iličić,si SVN,"FW,MF",Atalanta,it Serie A,32.0,1988.0,28,17,...,198.5,12.1,4.0,2.12,2.88,3.76,79.1,64.2,81.1,12.1
487,1836,Philippe Coutinho,br BRA,"MF,FW",Barcelona,es La Liga,28.0,1992.0,12,8,...,209.5,12.7,3.42,2.19,0.68,1.64,81.6,70.4,86.2,8.36


After scaling, we see that there is almost no change in the players identified as being close to Messi. However, there is a very large change in the players similar to Antonio.

If we look at our features, we see that there are a few features that are distances. These are progressive carry distance per 90 and progressive pass distance per 90. Without scaling the features, the algorithm will bias towards players who are more similar on these two numbers due to the large differences between different players. After scaling, there is less bias towards these two specific statistics.

Now lets use agglomerative clustering to cluster our forwards into groups. As a first attempt, we hypothesize that there are 5 large groups of forwards: wingers, false nines, target men, poachers, and second strikers

In [211]:
from sklearn.cluster import AgglomerativeClustering

scan = AgglomerativeClustering(n_clusters=5)

In [212]:
labels2 = scan.fit_predict(AFdf_scaled)

In [213]:
res2 = forwards.iloc[:, 1:6]
res2['label'] = labels2
res2[(res2['Player'] == 'Jack Grealish') | (res2['Player'] ==  'Lionel Messi') | (res2['Player'] == 'Roberto Firmino')
   | (res2['Player'] == 'Luis Suárez') | (res2['Player'] == 'Zlatan Ibrahimović')]

Unnamed: 0,Player,Nation,Pos,Squad,Comp,label
70,Roberto Firmino,br BRA,FW,Liverpool,eng Premier League,3
96,Zlatan Ibrahimović,se SWE,FW,Milan,it Serie A,3
220,Luis Suárez,uy URU,FW,Atlético Madrid,es La Liga,1
306,Jack Grealish,eng ENG,"FW,MF",Aston Villa,eng Premier League,2
360,Lionel Messi,ar ARG,"FW,MF",Barcelona,es La Liga,2


Let's investigate each group to see what types of players are in each.

In [214]:
#look at zeros
res2[res2['label'] == 0]

Unnamed: 0,Player,Nation,Pos,Squad,Comp,label
40,Samu Castillejo,es ESP,FW,Milan,it Serie A,0
123,Justin Kluivert,nl NED,FW,RB Leipzig,de Bundesliga,0
200,Alexis Saelemaekers,be BEL,FW,Milan,it Serie A,0
253,Aymen Barkok,ma MAR,"FW,MF",Eint Frankfurt,de Bundesliga,0
265,Julian Brandt,de GER,"FW,MF",Dortmund,de Bundesliga,0
287,Eric Junior Dina Ebimbe,fr FRA,"FW,MF",Dijon,fr Ligue 1,0
293,Fernando Forestieri,it ITA,"FW,MF",Udinese,it Serie A,0
338,Érik Lamela,ar ARG,"FW,MF",Tottenham,eng Premier League,0
388,Yeremi Pino,es ESP,"FW,MF",Villarreal,es La Liga,0
419,Romano Schmid,at AUT,"FW,MF",Werder Bremen,de Bundesliga,0


These look like a mix of wide forwards and attacking midfielders who look to score goals. If we repeat this with each group we see that we have a group of finishers, creative forwards, target men, and wingers. This classification is still not perfect, but it seems better than the attempt with the dimension reduced data.

In [215]:
wide_forwards = res2[res2['label'] == 0]
finishers = res2[res2['label'] == 1]
creators = res2[res2['label'] == 2]
targets = res2[res2['label'] == 3]
wingers = res2[res2['label'] == 4]

Let's now try to use these two methods to scout for a player. Let's imagine we are Aston Villa trying to replace Jack Grealish after his transfer to Man City. As a first attempt let's start with a list of players who are 'close' to Grealish.

In [187]:
neigh_scaled2 = NearestNeighbors(n_neighbors = 51)
neigh_scaled2.fit(AFdf_scaled)

In [190]:
def fifty_nearest_players(player):
    index = forwards.index[forwards['Player'] == player]
    sims = neigh_scaled2.kneighbors(AFdf_scaled[index].reshape(1,-1), return_distance = False).tolist()
    return forwards.loc[sims[0]]

In [206]:
#get dataframe of 50 closest players
JG_close_df = fifty_nearest_players('Jack Grealish')
#remove Grealish
df = JG_close_df.iloc[1:, :]
#restrict to players younger than 25
df = df[df['Age'] < 25]
#npxG+xA at least 0.5/90
df = df[df['npxG+xA/90'] > 0.5]
df

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PrgDistCarry/90,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90
413,1189,Jadon Sancho,eng ENG,"FW,MF",Dortmund,de Bundesliga,20.0,2000.0,26,24,...,188.6,10.4,3.54,2.23,2.53,1.79,73.9,61.3,82.9,9.52
275,1051,Kingsley Coman,fr FRA,"FW,MF",Bayern Munich,de Bundesliga,24.0,1996.0,29,23,...,161.5,10.3,1.85,2.41,2.36,2.56,61.0,47.1,77.1,11.5
363,1139,Aleksei Miranchuk,ru RUS,"FW,MF",Atalanta,it Serie A,24.0,1995.0,25,4,...,126.4,8.67,3.07,1.33,1.73,2.27,64.0,50.9,79.6,11.2
292,1068,Phil Foden,eng ENG,"FW,MF",Manchester City,eng Premier League,20.0,2000.0,28,17,...,124.3,7.61,1.83,2.06,2.17,1.61,57.3,46.5,81.2,8.83
449,1798,Houssem Aouar,fr FRA,"MF,FW",Lyon,fr Ligue 1,22.0,1998.0,30,23,...,173.2,8.74,2.47,1.36,1.92,2.73,64.0,53.4,83.4,7.73
197,962,Rodrygo,br BRA,FW,Real Madrid,es La Liga,19.0,2001.0,22,10,...,163.7,8.72,3.12,1.28,2.39,1.83,54.0,42.5,78.6,7.06
252,1028,Leon Bailey,jm JAM,"FW,MF",Leverkusen,de Bundesliga,22.0,1997.0,30,25,...,147.1,7.31,2.69,1.51,2.27,2.02,53.0,39.0,73.5,8.57
375,1151,Christopher Nkunku,fr FRA,"FW,MF",RB Leipzig,de Bundesliga,22.0,1997.0,28,19,...,113.0,6.24,2.33,1.19,2.29,2.29,60.2,44.1,73.3,9.57


Now we have a list of 8 players that are close in style and production to Grealish while also being under the age of 25. If we actually were passing these names to scouts, we would likely additionally remove Foden, Rodrygo, and Coman from the list.

Let's compare this to the list we would get from the clustering method.

In [216]:
#Grealish had label 2
forwards2 = forwards
forwards2['label'] = labels2
df2 = forwards2[forwards2['label'] == 2]
#players younger than 25
df2 = df2[df2['Age'] < 25]
#as productive as Grealish
df2 = df2[df2['npxG+xA/90'] > 0.5]
df2

Unnamed: 0,index,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,ProgCarry/90,CarryIntoThird/90,CarryIntoBox/90,Miscontrol/90,Dispossessed/90,PassTarget/90,PassesReceived/90,PassRec%,ProgPassReceived/90,label
306,1082,Jack Grealish,eng ENG,"FW,MF",Aston Villa,eng Premier League,24.0,1995.0,26,24,...,12.6,3.74,3.29,1.89,1.93,56.1,42.4,75.6,7.49,2
413,1189,Jadon Sancho,eng ENG,"FW,MF",Dortmund,de Bundesliga,20.0,2000.0,26,24,...,10.4,3.54,2.23,2.53,1.79,73.9,61.3,82.9,9.52,2
449,1798,Houssem Aouar,fr FRA,"MF,FW",Lyon,fr Ligue 1,22.0,1998.0,30,23,...,8.74,2.47,1.36,1.92,2.73,64.0,53.4,83.4,7.73,2
511,1860,Aleksandr Golovin,ru RUS,"MF,FW",Monaco,fr Ligue 1,24.0,1996.0,21,12,...,6.47,2.77,0.67,1.93,1.85,51.1,39.5,77.3,7.23,2


This leaves us with only 3 players to look at. We see that Sancho and Aouar are in both, while Golovin is unique to the clustering method.
Both methods yield similar results, but if you rely on nearest neighbor it requires you to choose a player whose profile you want and it may become quite computationally expensive to find nearest neighbor sets for several different players, especially when the data is much higher dimensional. The clustering approach only requires you to have a rough idea of a profile in mind.