# Machine Learning
## Group 4
## Assignment 4
## K-means clustering


## 2. Web scraping

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import unicodedata

from sklearn.cluster import KMeans
import numpy as np

In [34]:
attributes = ['Crossing', 'Finishing', 'Heading Accuracy', 'Short Passing', 'Volleys', 'Dribbling', 'Curve', 'Free Kick Accuracy',
              'Long Passing','Ball Control', 'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance', 'Shot Power','Jumping',
              'Stamina', 'Strength','Long Shots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure','Marking',
              'Standing Tackle','Sliding Tackle','GK Diving','GK Handling', 'GK Kicking','GK Positioning','GK Reflexes']

              
links = []
for offset in ['0','100','200']:
    page = requests.get('http://sofifa.com/players?na=52&offset=' + offset)
    soup = BeautifulSoup(page.content,'html.parser')
    for link in soup.find_all('a'):
        links.append(link.get('href'))
           
links = ['http://sofifa.com' + l for l in links if 'player/' in l]

pattern = r"""\s*([\w\s]*)"""

for attr in attributes:
    pattern += r""".*?(\d*\s*"""+attr+r""")""" #Attribute variables
    
pat = re.compile(pattern, re.DOTALL)

rows = []
for j,link in enumerate(links):
    row = [link]
    playerpage = requests.get(link)
    playersoup = BeautifulSoup(playerpage.content, 'html.parser')
    text = playersoup.get_text()
    a = pat.match(text)
    row.append(a.group(1))
    for i in range(2,len(attributes)+2):
        row.append(int(a.group(i).split()[0]))
    rows.append(row)
df = pd.DataFrame(rows,columns =['link','name'] + attributes)
df.to_csv('ArgentinaPlayers.csv',index=False)

The code above scrapes the FIFA attributes for the top 300 Argentinian players on the sofifa.com website.

The code starts by defining each of the 34 attributes that characterize each player in the FIFA game. These attributes range from Passing and Dribbling to Tackling and Strength. Each player in the game is represented by a combination of the above attributes.

We then wish to store the links to the top 300 rated players for a specific country. The website stores the players groups of 100s. For each group of 100 players, we request all of the html content on the page. The html content on the page that links to each specific player contains a string "a" which is the HTML tag. For each of the "a" tags, we append to the links list the html's href tag. The href tag may contains the player's unique ID identifier. For example, Leonel Messi's player ID is 158023. We filter the href tags to only contain those that have the string "player" in them. These tags correspond to each player's personal page which will list their attributes. 

For each string that contains a player, we create a URL with the link to that player's page by concatenating the href tag with http://sofifa.com. By accessing the concatenated URL, we can scrape each player's game attribues. We can do this by noticing that the attributes are embedded in a specific Regular Express (regex) style.  Through the specific regex, we can extract the player's attributes because they all follow a specific pattern. The first part of the regex extracts the player's name, and the subsequent parts correspond to the regex to identify where the attributes are located in the HTML file.

We iterate through each of the player links we created prior. For each link, we parse through the resulting html file using the BeautifulSoup function. By using the get_text() function, we extract only the text elements of the web page. Finally, from the resulting text, we match the regex pattern we defined earlier to extract the portions of the text that follow the regex. By extracting the regex style expressions from the text, we have each player's attributes in the first portion of the regex. We append each attribute to the player's data. 

Because many of our players have non English names, their names may contain characters that are not ASCII. The larger subset of characters, called Unicode, needs to be displayed properly when extracting the player's name. Fortunately the HTML and the beautiful soup package support Unicode characters through the get_text and html.parser functions. Therefore, when we extract the player names, they are already in the proper Unicode format and we have to make no further changes. 

Finally, we have each player's attributes as a list inside a list with all of the player data. We convert the list into a pandas dataframe with each player as a row. The columns are the player's URL to access the data, the player's name, and then all of the attributes defined at the start. We will use this dataframe as a way to cluster our players because the input into the kMeans function can be a pandas dataframe. Finally, we output the resulting data into a csv file.

## 3. 500 English players

To download the first 500 English players instead of the first 300 Argentinian players, we need to modify the code in two places. 

First we need to modify the code to parse players from a different country. As we noticed above, Argentina's country code on the sofifa.com website is na=52. We change this value to na=14, which corresponds to England's players. 

We also need to modify the number of players parsed, from 300 to 500. This is done by modifying the first for loop into the code. We change the list of the offset from ['0','100','200'] to ['0','100','200','300','400']. This adds 2 more pages of 100 players for the code to search through. The code returns the data for the first 500 English players with these two modifications.

The new code is shown below.


In [35]:
## NEW CODE IS THIS FOR LOOP!
# for offset in ['0','100','200','300','400']:
#    page = requests.get('http://sofifa.com/players?na=14&offset=' + offset)
#    soup = BeautifulSoup(page.content,'html.parser')
#    for link in soup.find_all('a'):
#        links.append(link.get('href'))


## 4. Training the model

In [36]:
#Drop link and name from variables to train on
dfX = df.drop(['link','name'],axis=1)

#Train the model
football = KMeans(n_clusters=5,random_state=12345).fit(dfX)

## 5. Labelling the clusters

In [37]:
labels = football.labels_

#Extracting players into their respective clusters
c1 = df.ix[labels==0,]
c2 = df.ix[labels==1,]
c3 = df.ix[labels==2,]
c4 = df.ix[labels==3,]
c5 = df.ix[labels==4,]

### Cluster 1

In [38]:
print(c1['name'])

0                 Lionel Messi 
2                Sergio Agüero 
3               Ángel Di María 
4                 Paulo Dybala 
5               Nicolás Gaitán 
10              Javier Pastore 
20                 Erik Lamela 
21             Alejandro Gómez 
23               Diego Perotti 
26          Fernando Belluschi 
28              Eduardo Salvio 
31              Manuel Lanzini 
33              Luciano Vietto 
34               Pablo Batalla 
35              Lisandro López 
38             Rodrigo Palacio 
39                Ángel Correa 
40              Ignacio Piatti 
42                   José Sosa 
50                Diego Valeri 
51                Pablo Piatti 
56            Diego Buonanotte 
58          Rogelio Funes Mori 
60             Pablo De Blasis 
70                Franco Cervi 
71                Mauro Zárate 
73              Lautaro Acosta 
74            Sebastián Blanco 
79                 Ismael Sosa 
80       Alexander Szymanowski 
                 ...           
194     

### Cluster 2

In [39]:
print(c2['name'])

11           Gerónimo Rulli 
32            Sergio Romero 
37         Marcelo Barovero 
43            Nahuel Guzmán 
44          Willy Caballero 
53        Sebastián Torrico 
54        Agustín Marchesín 
64          Mariano Andújar 
66            Franco Armani 
83            Agustín Orión 
105         Mariano Barbosa 
108        Fernando Monetti 
123              Germán Lux 
126         Albano Bizzarri 
147      Juan Pablo Carrizo 
162    Cristian Campestrini 
174          Guillermo Sara 
183        Luciano Pocrnjic 
192             Rodrigo Rey 
203             Marcos Díaz 
215           Javier García 
229             Jorge Broun 
235            Oscar Ustari 
243          Julián Speroni 
253            Luis Ardente 
290         Nereo Fernández 
Name: name, dtype: object


### Cluster 3

In [40]:
print(c3['name'])

7                Éver Banega 
8          Javier Mascherano 
12               Marcos Rojo 
13           Mateo Musacchio 
15            Ezequiel Garay 
16            Pablo Zabaleta 
17              Lucas Biglia 
18         Augusto Fernández 
19           Roberto Pereyra 
22             Claudio Yacob 
27          Cristian Ansaldi 
36            Nicolás Pareja 
47             Guido Pizarro 
48              Marcos Acuña 
49                Enzo Pérez 
57             David Abraham 
65               Pablo Pérez 
67              Lucas Castro 
69           Leandro Paredes 
75         Esteban Cambiasso 
76              Gino Peruzzi 
77             Fernando Gago 
81         Facundo Roncaglia 
82              Emmanuel Mas 
84        Matías Kranevitter 
87               Óscar Trejo 
89         Ramiro Funes Mori 
90         Gonzalo Escalante 
96           Leonardo Ponzio 
97        Nicolás Tagliafico 
                ...          
214             Juan Mercier 
216              Jorge Ortiz 
221       

### Cluster 4

In [41]:
print(c4['name'])

6           Nicolás Otamendi 
14         Gonzalo Rodríguez 
24            Federico Fazio 
25            Gustavo Cabral 
29        Federico Fernández 
30            Lisandro López 
41             Víctor Cuesta 
46         Martín Demichelis 
52       Santiago Gentiletti 
63          Nicolás Burdisso 
68           Germán Pezzella 
72          Mauro Dos Santos 
88           Jonatan Maidana 
93          Santiago Vergini 
106           Matías Caruzzo 
109            Luciano Lollo 
111         Martín Mantovani 
113    Julio Alberto Barroso 
120         Jonathan Schunke 
125           Nicolás Spolli 
135          Marcos Angeleri 
137           Renato Civelli 
143         Juan Insaurralde 
146        Carlos Izquierdoz 
150          Matías Zaldivia 
157       José María Basanta 
164         Leandro Desábato 
173              Juan Forlin 
176           Fernando Tobio 
180       Guillermo Burdisso 
185              Daniel Díaz 
193             Germán Conti 
200           Ezequiel Muñoz 
205       

### Cluster 5

In [42]:
print(c5['name'])

1          Gonzalo Higuaín 
9             Mauro Icardi 
45            Lucas Alario 
55             Marco Ruben 
59          Nicolás Blandi 
61         Darío Benedetto 
62             Gustavo Bou 
78           Mauro Boselli 
95         Franco Di Santo 
100       Jonathan Calleri 
116          Silvio Romero 
118          Emiliano Sala 
124      Maximiliano López 
138       Facundo Ferreyra 
141      Sebastián Driussi 
151            Julio Furch 
155           Lucas Viatri 
156         Guido Carrillo 
163              José Sand 
165           Germán Denis 
167       Enrique Triverio 
170     Juan Ignacio Gomez 
171         Leonardo Ulloa 
179            Franco Jara 
184       Giovanni Simeone 
187         Sebastián Leto 
190     Juan Martín Lucero 
191            Mauro Matos 
211        Milton Caraglio 
212           Diego Churín 
219    Denis Stracqualursi 
236         Mariano Pavone 
239         Ezequiel Ponce 
265        Darío Cvitanich 
267          Claudio Riaño 
268          Ismael 

## 6. Prediction of new point

To assign a new player based on the attributes into a cluster, we first retreive the centroids of the existing 5 clusters.

We then input the attributes of the new player:

Crossing     | 45

Sprint Speed | 40

Long Shots   | 35

Aggression   | 45

Marking      | 60

Finishing    | 40

GK_Handling  | 15

We find the index of the appropriate columns in our dataframe and find the equivalent attributes of the centroids.

For our new point, we calculate the Euclidian distance for the player to each one of the available centroids, classifying him into the closest cluster, the one with minimum distance.

In [43]:
centroid = football.cluster_centers_

newPoint = [45,40,40,35,45,60,15]

var = set(['Crossing','Finishing','Sprint Speed','Long Shots','Aggression','Marking','GK Handling'])

varLocations = [i for i,x in enumerate(list(df)) if x in var]
centroidSub = [x[varLocations] for x in centroid]

distances = [np.linalg.norm(x-newPoint) for x in centroidSub]
distances.index(min(distances))+1

1

We group the new point into Cluster 1.