# Machine Learning
## Group 4
## Assignment 4
## K-means clustering


## 2. Web scraping

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import unicodedata

from sklearn.cluster import KMeans
import numpy as np

In [4]:
attributes = ['Crossing', 'Finishing', 'Heading Accuracy', 'Short Passing', 'Volleys', 'Dribbling', 'Curve', 'Free Kick Accuracy',
              'Long Passing','Ball Control', 'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance', 'Shot Power','Jumping',
              'Stamina', 'Strength','Long Shots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure','Marking',
              'Standing Tackle','Sliding Tackle','GK Diving','GK Handling', 'GK Kicking','GK Positioning','GK Reflexes']

              
links = []
for offset in ['0','100','200']:
    page = requests.get('http://sofifa.com/players?na=52&offset=' + offset)
    soup = BeautifulSoup(page.content,'html.parser')
    for link in soup.find_all('a'):
        links.append(link.get('href'))
           
links = ['http://sofifa.com' + l for l in links if 'player/' in l]

pattern = r"""\s*([\w\s]*)"""

for attr in attributes:
    pattern += r""".*?(\d*\s*"""+attr+r""")""" #Attribute variables
    
pat = re.compile(pattern, re.DOTALL)

rows = []
for j,link in enumerate(links):
    row = [link]
    playerpage = requests.get(link)
    playersoup = BeautifulSoup(playerpage.content, 'html.parser')
    text = playersoup.get_text()
    a = pat.match(text)
    row.append(a.group(1))
    for i in range(2,len(attributes)+2):
        row.append(int(a.group(i).split()[0]))
    rows.append(row)
df = pd.DataFrame(rows,columns =['link','name'] + attributes)
df.to_csv('ArgentinaPlayers.csv',index=False)

The code above scrapes the FIFA attributes for the top 300 Argentinian players on the sofifa.com website.

The code starts by defining each of the 34 attributes that characterize each player in the FIFA game. These attributes range from Passing and Dribbling to Tackling and Strength. Each player in the game is represented by a combination of the above attributes.

We then wish to store the links to the top 300 rated players for a specific country. The website stores the players groups of 100s. For each group of 100 players, we request all of the html content on the page. The html content on the page that links to each specific player contains a string "a" which is the HTML tag. For each of the "a" tags, we append to the links list the html's href tag. The href tag may contains the player's unique ID identifier. For example, Leonel Messi's player ID is 158023. We filter the href tags to only contain those that have the string "player" in them. These tags correspond to each player's personal page which will list their attributes. 

For each string that contains a player, we create a URL with the link to that player's page by concatenating the href tag with http://sofifa.com. By accessing the concatenated URL, we can scrape each player's game attribues. We can do this by noticing that the attributes are embedded in a specific Regular Express (regex) style.  Through the specific regex, we can extract the player's attributes because they all follow a specific pattern. The first part of the regex extracts the player's name, and the subsequent parts correspond to the regex to identify where the attributes are located in the HTML file.

We iterate through each of the player links we created prior. For each link, we parse through the resulting html file using the BeautifulSoup function. By using the get_text() function, we extract only the text elements of the web page. Finally, from the resulting text, we match the regex pattern we defined earlier to extract the portions of the text that follow the regex. By extracting the regex style expressions from the text, we have each player's attributes in the first portion of the regex. We append each attribute to the player's data. 

Because many of our players have non English names, their names may contain characters that are not ASCII. The larger subset of characters, called Unicode, needs to be displayed properly when extracting the player's name. Fortunately the HTML and the beautiful soup package support Unicode characters through the get_text and html.parser functions. Therefore, when we extract the player names, they are already in the proper Unicode format and we have to make no further changes. 

Finally, we have each player's attributes as a list inside a list with all of the player data. We convert the list into a pandas dataframe with each player as a row. The columns are the player's URL to access the data, the player's name, and then all of the attributes defined at the start. We will use this dataframe as a way to cluster our players because the input into the kMeans function can be a pandas dataframe. Finally, we output the resulting data into a csv file.

## 3. 500 English players

To download the first 500 English players instead of the first 300 Argentinian players, we need to modify the code in two places. 

First we need to modify the code to parse players from a different country. As we noticed above, Argentina's country code on the sofifa.com website is na=52. We change this value to na=14, which corresponds to England's players. 

We also need to modify the number of players parsed, from 300 to 500. This is done by modifying the first for loop into the code. We change the list of the offset from ['0','100','200'] to ['0','100','200','300','400']. This adds 2 more pages of 100 players for the code to search through. The code returns the data for the first 500 English players with these two modifications.

The new code is shown below.


In [5]:
## NEW CODE IS THIS FOR LOOP!
# for offset in ['0','100','200','300','400']:
#    page = requests.get('http://sofifa.com/players?na=14&offset=' + offset)
#    soup = BeautifulSoup(page.content,'html.parser')
#    for link in soup.find_all('a'):
#        links.append(link.get('href'))


## 4. Training the model

In [6]:
#Drop link and name from variables to train on
dfX = df.drop(['link','name'],axis=1)

#Train the model
football = KMeans(n_clusters=5,random_state=12345).fit(dfX)

## 5. Labelling the clusters

In [7]:
labels = football.labels_

#Extracting players into their respective clusters
c1 = df.ix[labels==0,]
c2 = df.ix[labels==1,]
c3 = df.ix[labels==2,]
c4 = df.ix[labels==3,]
c5 = df.ix[labels==4,]

### Forward Cluster [1]

In [8]:
print(c1['name'])

0                 Lionel Messi 
2                Sergio Agüero 
3               Ángel Di María 
4                 Paulo Dybala 
5               Nicolás Gaitán 
10              Javier Pastore 
20                 Erik Lamela 
21             Alejandro Gómez 
23               Diego Perotti 
26          Fernando Belluschi 
28              Eduardo Salvio 
31              Manuel Lanzini 
33              Luciano Vietto 
34               Pablo Batalla 
35              Lisandro López 
38             Rodrigo Palacio 
39                Ángel Correa 
40              Ignacio Piatti 
42                   José Sosa 
50                Diego Valeri 
51                Pablo Piatti 
56            Diego Buonanotte 
58          Rogelio Funes Mori 
60             Pablo De Blasis 
71                Franco Cervi 
72                Mauro Zárate 
73              Lautaro Acosta 
74            Sebastián Blanco 
79                 Ismael Sosa 
80       Alexander Szymanowski 
                 ...           
182     

In [9]:
c1.mean()-dfX.mean()

Crossing              12.911498
Finishing             15.745243
Heading Accuracy      -3.512472
Short Passing          8.004345
Volleys               14.867790
Dribbling             15.532022
Curve                 16.058727
Free Kick Accuracy    15.553596
Long Passing           7.585730
Ball Control          10.623970
Acceleration          12.313296
Sprint Speed          10.454494
Agility               13.962022
Reactions              0.845281
Balance               13.356592
Shot Power             9.046929
Jumping               -3.107341
Stamina                3.432285
Strength             -10.386779
Long Shots            14.941798
Aggression            -6.055880
Interceptions        -12.408801
Positioning           13.590936
Vision                12.304457
Penalties              9.480000
Composure              3.581910
Marking              -15.601573
Standing Tackle      -14.024345
Sliding Tackle       -14.206292
GK Diving             -5.485506
GK Handling           -5.547566
GK Kicki

Players in this cluster have particularly high Finishing, Curve and Free Kick Accuracy scores in addition to high Vision, Dribbling and Long Shots compared to the other Argentinian players in the dataset. Based on this and the player profiles within the cluster,it is likely that the cluster contains offensive players in the centre forward position.

### Goalkeeper Cluster [2]

In [10]:
print(c2['name'])

11           Gerónimo Rulli 
32            Sergio Romero 
37         Marcelo Barovero 
43            Nahuel Guzmán 
44          Willy Caballero 
53        Sebastián Torrico 
54        Agustín Marchesín 
64          Mariano Andújar 
67            Franco Armani 
83            Agustín Orión 
106         Mariano Barbosa 
109        Fernando Monetti 
124              Germán Lux 
127         Albano Bizzarri 
148      Juan Pablo Carrizo 
163    Cristian Campestrini 
175          Guillermo Sara 
184        Luciano Pocrnjic 
193             Rodrigo Rey 
205             Marcos Díaz 
217           Javier García 
231             Jorge Broun 
237            Oscar Ustari 
246          Julián Speroni 
255            Luis Ardente 
270       Leandro Fernández 
292         Nereo Fernández 
Name: name, dtype: object


In [11]:
c2.mean()-dfX.mean()

Crossing             -40.964074
Finishing            -40.171111
Heading Accuracy     -46.638148
Short Passing        -40.445926
Volleys              -38.381481
Dribbling            -47.924074
Curve                -42.515556
Free Kick Accuracy   -37.310741
Long Passing         -37.459630
Ball Control         -47.385185
Acceleration         -26.272222
Sprint Speed         -25.690741
Agility              -21.531111
Reactions             -0.485556
Balance              -19.184815
Shot Power           -44.891481
Jumping               -7.757778
Stamina              -34.740000
Strength              -3.032222
Long Shots           -44.434815
Aggression           -33.537778
Interceptions        -31.672222
Positioning          -45.104444
Vision               -15.981852
Penalties            -32.223704
Composure            -23.852963
Marking              -33.045185
Standing Tackle      -36.981481
Sliding Tackle       -33.795556
GK Diving             56.813704
GK Handling           53.955556
GK Kicki

Players in this cluster have very high average value in the GK parameters. This cluster is likely to be a cluster of Goalkeepers.

### Center Defensive Midfield Cluster [3]

In [12]:
print(c3['name'])

6           Nicolás Otamendi 
7                Éver Banega 
8          Javier Mascherano 
12               Marcos Rojo 
13           Mateo Musacchio 
15            Ezequiel Garay 
16            Pablo Zabaleta 
17              Lucas Biglia 
18         Augusto Fernández 
19           Roberto Pereyra 
22             Claudio Yacob 
27          Cristian Ansaldi 
36            Nicolás Pareja 
47             Guido Pizarro 
48              Marcos Acuña 
49                Enzo Pérez 
57             David Abraham 
65               Pablo Pérez 
68              Lucas Castro 
70           Leandro Paredes 
75         Esteban Cambiasso 
76              Gino Peruzzi 
77             Fernando Gago 
81         Facundo Roncaglia 
82              Emmanuel Mas 
84        Matías Kranevitter 
87               Óscar Trejo 
89         Ramiro Funes Mori 
90         Gonzalo Escalante 
96           Leonardo Ponzio 
                ...          
206            Rodrigo Braña 
210            Emanuel Insúa 
216       

In [13]:
c3.mean()-dfX.mean()

Crossing               7.534908
Finishing             -2.429963
Heading Accuracy       3.575934
Short Passing          6.923223
Volleys               -1.650916
Dribbling              4.985165
Curve                  4.615092
Free Kick Accuracy     5.166264
Long Passing           8.642527
Ball Control           5.213919
Acceleration           2.660623
Sprint Speed           2.522527
Agility                1.459121
Reactions              1.784286
Balance                2.490806
Shot Power             6.384872
Jumping                3.001685
Stamina               10.355238
Strength               2.314542
Long Shots             7.083297
Aggression            12.188718
Interceptions         21.652564
Positioning            2.731941
Vision                 3.862674
Penalties              0.227253
Composure              3.033077
Marking               22.920220
Standing Tackle       23.320513
Sliding Tackle        23.419341
GK Diving             -5.626264
GK Handling           -5.145788
GK Kicki

Players in this cluster have very high Tackling, Marking and Interception scores. Their other scores are also rather average. Their positions are usually varied, from midfielders to fullbacks and a few central defenders. We assume that this cluster is comprised of players playing in the midfield position, potentially in the central defensive midfield due to the high ratings in the defensive departments. The averages in the defensive attributes might be inflated however due to the inclusion of fullbacks, which share a lot of attributes that are also relevant in the central defensive midfield position. Since several Central attacking midfielders are included in the cluster as well, we thus conclude that the cluster is mainly comprised of all types of central midfielders, despite the inclusion of a few fullbacks and centre defenders.

### Centre Defense Cluster [4]

In [14]:
print(c4['name'])

14       Gonzalo Rodríguez 
24          Federico Fazio 
25          Gustavo Cabral 
29      Federico Fernández 
30          Lisandro López 
41           Víctor Cuesta 
46       Martín Demichelis 
52     Santiago Gentiletti 
63        Nicolás Burdisso 
66      Nicolás Tagliafico 
69         Germán Pezzella 
88         Jonatan Maidana 
93        Santiago Vergini 
107         Matías Caruzzo 
110          Luciano Lollo 
111        Emiliano Rigoni 
112       Martín Mantovani 
121       Jonathan Schunke 
126         Nicolás Spolli 
136        Marcos Angeleri 
138         Renato Civelli 
144       Juan Insaurralde 
147      Carlos Izquierdoz 
151        Matías Zaldivia 
158     José María Basanta 
165       Leandro Desábato 
168       Maximiliano Meza 
174            Juan Forlin 
177         Fernando Tobio 
181     Guillermo Burdisso 
186            Daniel Díaz 
194           Germán Conti 
202         Ezequiel Muñoz 
208        Javier Gandolfi 
211        Emanuel Mammana 
224          Ezequie

In [15]:
c4.mean()-dfX.mean()

Crossing             -13.423333
Finishing            -20.166667
Heading Accuracy      12.590000
Short Passing         -5.886667
Volleys              -16.913333
Dribbling            -16.570000
Curve                -15.266667
Free Kick Accuracy   -15.770000
Long Passing          -4.490000
Ball Control          -8.773333
Acceleration         -12.943333
Sprint Speed          -9.650000
Agility              -16.860000
Reactions             -4.670000
Balance              -16.796667
Shot Power           -12.736667
Jumping                0.033333
Stamina               -6.273333
Strength              10.603333
Long Shots           -22.780000
Aggression            11.753333
Interceptions         19.123333
Positioning          -20.013333
Vision               -19.456667
Penalties            -10.440000
Composure             -2.050000
Marking               24.480000
Standing Tackle       22.766667
Sliding Tackle        21.080000
GK Diving             -5.930000
GK Handling           -6.106667
GK Kicki

Players in this cluster have a high Marking and Tackling score. Unlike Cluster3, we find that they have very poor Finishing and Free Kick Accuracy scores, and below average sprint speed and acceleration. This is thus likely a cluster of Central Defenders.

### Striker Cluster [5]

In [16]:
print(c5['name'])

1          Gonzalo Higuaín 
9             Mauro Icardi 
45            Lucas Alario 
55             Marco Ruben 
59          Nicolás Blandi 
61         Darío Benedetto 
62             Gustavo Bou 
78           Mauro Boselli 
95         Franco Di Santo 
99        Jonathan Calleri 
105       Mauro Dos Santos 
117          Silvio Romero 
119          Emiliano Sala 
125      Maximiliano López 
139       Facundo Ferreyra 
142      Sebastián Driussi 
152            Julio Furch 
156           Lucas Viatri 
157         Guido Carrillo 
164              José Sand 
167       Enrique Triverio 
171     Juan Ignacio Gomez 
172         Leonardo Ulloa 
180            Franco Jara 
185       Giovanni Simeone 
188         Sebastián Leto 
191     Juan Martín Lucero 
192            Mauro Matos 
195       Gastón Fernández 
207           Germán Denis 
214        Milton Caraglio 
215           Diego Churín 
221    Denis Stracqualursi 
239         Mariano Pavone 
242         Ezequiel Ponce 
267        Darío Cvi

In [17]:
c5.mean()-dfX.mean()

Crossing              -1.339612
Finishing             21.226822
Heading Accuracy      14.347209
Short Passing          1.022636
Volleys               16.487597
Dribbling              6.661628
Curve                  1.443101
Free Kick Accuracy    -1.360698
Long Passing          -5.248605
Ball Control           6.931783
Acceleration           0.430620
Sprint Speed           0.375581
Agility                1.138140
Reactions              0.209535
Balance               -1.338992
Shot Power            10.760543
Jumping                4.911473
Stamina                0.089457
Strength               6.174496
Long Shots             8.473023
Aggression            -5.868527
Interceptions        -22.488760
Positioning           17.681085
Vision                -0.982713
Penalties             12.270698
Composure              3.528605
Marking              -23.929767
Standing Tackle      -23.577519
Sliding Tackle       -23.449302
GK Diving             -5.517907
GK Handling           -4.406202
GK Kicki

These players have exceptional Volley and Finishing scores. Based on their low scores in Marking and Tackling, it is likely that these players are strikers. The vision of the players contained in this cluster is also below average, which differnciates it from the centre forward cluster.

## 6. Prediction of new point

To assign a new player based on the attributes into a cluster, we first retreive the centroids of the existing 5 clusters.

We then input the attributes of the new player:

Crossing     | 45

Sprint Speed | 40

Long Shots   | 35

Aggression   | 45

Marking      | 60

Finishing    | 40

GK_Handling  | 15

We find the index of the appropriate columns in our dataframe and find the equivalent attributes of the centroids.

For our new point, we calculate the Euclidian distance for the player to each one of the available centroids, classifying him into the closest cluster, the one with minimum distance.

In [18]:
centroid = football.cluster_centers_

newPoint = [45,40,40,35,45,60,15]

var = set(['Crossing','Finishing','Sprint Speed','Long Shots','Aggression','Marking','GK Handling'])

varLocations = [i for i,x in enumerate(list(df)) if x in var]
centroidSub = [x[varLocations] for x in centroid]

distances = [np.linalg.norm(x-newPoint) for x in centroidSub]
distances.index(min(distances))+1

4

We group the new point into Cluster 4 (Defense), this player is likely a Defender.