# Bonus lesson: Using NearestNeighbors to analyze soccer and city data

In this extracurricular lesson, we'll be using the NearestNeighbors algorithm (obviously very similar to KNN) to find soccer player who are similar to one another. Then we'll use it to determine which city Amazon should choose as it's second national headquarters.

This is also bit of a lesson in recommendation systems because we'll be "recommending" cities and players.

## FIFA Player Data

Dataset information:https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset

In [1]:
#Imports
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
pd.set_option("max.columns", 35)

In [3]:
fifa = pd.read_csv("../data/fifa_data.csv")
fifa.head()

Unnamed: 0,Name,Nationality,National_Position,National_Kit,Club,Club_Position,Club_Kit,Club_Joining,Contract_Expiry,Rating,Height,Weight,Preffered_Foot,Birth_Date,Age,Preffered_Position,Work_Rate,...,Strength,Balance,Agility,Jumping,Heading,Shot_Power,Finishing,Long_Shots,Curve,Freekick_Accuracy,Penalties,Volleys,GK_Positioning,GK_Diving,GK_Kicking,GK_Handling,GK_Reflexes
0,Cristiano Ronaldo,Portugal,LS,7.0,Real Madrid,LW,7.0,07/01/2009,2021.0,94,185 cm,80 kg,Right,02/05/1985,32,LW/ST,High / Low,...,80,63,90,95,85,92,93,90,81,76,85,88,14,7,15,11,11
1,Lionel Messi,Argentina,RW,10.0,FC Barcelona,RW,10.0,07/01/2004,2018.0,93,170 cm,72 kg,Left,06/24/1987,29,RW,Medium / Medium,...,59,95,90,68,71,85,95,88,89,90,74,85,14,6,15,11,8
2,Neymar,Brazil,LW,10.0,FC Barcelona,LW,11.0,07/01/2013,2021.0,92,174 cm,68 kg,Right,02/05/1992,25,LW,High / Medium,...,49,82,96,61,62,78,89,77,79,84,81,83,15,9,15,9,11
3,Luis Suárez,Uruguay,LS,9.0,FC Barcelona,ST,9.0,07/11/2014,2021.0,92,182 cm,85 kg,Right,01/24/1987,30,ST,High / Medium,...,76,60,86,69,77,87,94,86,86,84,85,88,33,27,31,25,37
4,Manuel Neuer,Germany,GK,1.0,FC Bayern,GK,1.0,07/01/2011,2021.0,92,193 cm,92 kg,Right,03/27/1986,31,GK,Medium / Medium,...,83,35,52,78,25,25,13,16,14,11,47,11,91,89,95,90,89


Data cleaning time. I'm going to filter the data down to just the attributes and players who are not goalkeepers.

In [4]:

#Set name as index
fifa.set_index("Name", inplace=True)
#Drop goal keeepers
fifa = fifa[fifa.Club_Position != "GK"].copy()
#List of columns to drop.
cols = ['Nationality', 'National_Position', 'National_Kit', 'Club','Club_Position',
        'Club_Kit', 'Club_Joining', 'Contract_Expiry', 'Height',
        'Weight', 'Preffered_Foot', 'Birth_Date', 'Age',
        'Preffered_Position', 'Work_Rate', "GK_Positioning", "GK_Diving", "GK_Kicking",
       "GK_Handling", "GK_Reflexes"]
#Drop these columns
fifa.drop(cols, axis = 1, inplace=True)
#lower case column names
fifa.columns = fifa.columns.str.lower()

fifa.head()

Unnamed: 0_level_0,rating,weak_foot,skill_moves,ball_control,dribbling,marking,sliding_tackle,standing_tackle,aggression,reactions,attacking_position,interceptions,vision,composure,crossing,short_pass,long_pass,acceleration,speed,stamina,strength,balance,agility,jumping,heading,shot_power,finishing,long_shots,curve,freekick_accuracy,penalties,volleys
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
Cristiano Ronaldo,94,4,5,93,92,22,23,31,63,96,94,29,85,86,84,83,77,91,92,92,80,63,90,95,85,92,93,90,81,76,85,88
Lionel Messi,93,4,4,95,97,13,26,28,48,95,93,22,90,94,77,88,87,92,87,74,59,95,90,68,71,85,95,88,89,90,74,85
Neymar,92,5,5,95,96,21,33,24,56,88,90,36,80,80,75,81,75,93,90,79,49,82,96,61,62,78,89,77,79,84,81,83
Luis Suárez,92,4,4,91,86,30,38,45,78,93,92,41,84,83,77,83,64,88,77,89,76,60,86,69,77,87,94,86,86,84,85,88
Robert Lewandowski,90,4,3,87,85,25,19,42,80,88,89,39,78,87,62,83,65,79,82,79,84,79,78,84,85,86,91,82,77,76,81,86


In [5]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16956 entries, Cristiano Ronaldo to Barry Richardson
Data columns (total 32 columns):
rating                16956 non-null int64
weak_foot             16956 non-null int64
skill_moves           16956 non-null int64
ball_control          16956 non-null int64
dribbling             16956 non-null int64
marking               16956 non-null int64
sliding_tackle        16956 non-null int64
standing_tackle       16956 non-null int64
aggression            16956 non-null int64
reactions             16956 non-null int64
attacking_position    16956 non-null int64
interceptions         16956 non-null int64
vision                16956 non-null int64
composure             16956 non-null int64
crossing              16956 non-null int64
short_pass            16956 non-null int64
long_pass             16956 non-null int64
acceleration          16956 non-null int64
speed                 16956 non-null int64
stamina               16956 non-null int64
strength 

We now have a dataset of 16956 players and 32 attributes. For the purpose of this exercise I am going to reduce the size of thise dataset by filtering out players whose "Rating" value is less than 70.

In [6]:
fifa = fifa[fifa.rating >= 70]
fifa.shape

(5232, 32)

Now we have 5232 players

In [7]:
#Drop the rating column
fifa.drop("rating", axis = 1, inplace= True)

In [8]:
fifa.head()

Unnamed: 0_level_0,weak_foot,skill_moves,ball_control,dribbling,marking,sliding_tackle,standing_tackle,aggression,reactions,attacking_position,interceptions,vision,composure,crossing,short_pass,long_pass,acceleration,speed,stamina,strength,balance,agility,jumping,heading,shot_power,finishing,long_shots,curve,freekick_accuracy,penalties,volleys
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
Cristiano Ronaldo,4,5,93,92,22,23,31,63,96,94,29,85,86,84,83,77,91,92,92,80,63,90,95,85,92,93,90,81,76,85,88
Lionel Messi,4,4,95,97,13,26,28,48,95,93,22,90,94,77,88,87,92,87,74,59,95,90,68,71,85,95,88,89,90,74,85
Neymar,5,5,95,96,21,33,24,56,88,90,36,80,80,75,81,75,93,90,79,49,82,96,61,62,78,89,77,79,84,81,83
Luis Suárez,4,4,91,86,30,38,45,78,93,92,41,84,83,77,83,64,88,77,89,76,60,86,69,77,87,94,86,86,84,85,88
Robert Lewandowski,4,3,87,85,25,19,42,80,88,89,39,78,87,62,83,65,79,82,79,84,79,78,84,85,86,91,82,77,76,81,86


Scaling time.

In [9]:
#Fit transform data using StandardScaler

ss = StandardScaler()
fifa_ss = ss.fit_transform(fifa)


In [10]:
#Convert to data frame
fifa_ss = pd.DataFrame(fifa_ss, index=fifa.index, columns= fifa.columns)
fifa_ss.head(30)

Unnamed: 0_level_0,weak_foot,skill_moves,ball_control,dribbling,marking,sliding_tackle,standing_tackle,aggression,reactions,attacking_position,interceptions,vision,composure,crossing,short_pass,long_pass,acceleration,speed,stamina,strength,balance,agility,jumping,heading,shot_power,finishing,long_shots,curve,freekick_accuracy,penalties,volleys
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
Cristiano Ronaldo,1.277111,3.097341,1.941744,1.731423,-1.311088,-1.328116,-1.122345,-0.172488,4.178137,1.919274,-1.311986,1.714167,1.875467,1.476555,1.356676,1.161609,1.610164,1.731246,1.692856,0.867674,-0.252098,1.582623,2.306064,1.572406,1.783306,2.08978,1.818682,1.318979,1.307741,1.879504,1.991155
Lionel Messi,1.277111,1.806906,2.103142,2.053962,-1.707442,-1.194838,-1.25828,-1.170192,4.013949,1.863299,-1.652222,2.083276,2.645478,1.042401,1.821085,1.971331,1.685166,1.338757,0.228926,-0.920973,2.092877,1.582623,-0.067305,0.635547,1.289407,2.197985,1.702422,1.782492,2.117156,1.136338,1.819532
Neymar,2.733485,3.097341,2.103142,1.989454,-1.355127,-0.883854,-1.439528,-0.638083,2.864632,1.695373,-0.97175,1.345059,1.297959,0.918357,1.170912,0.999664,1.760168,1.57425,0.635573,-1.77271,1.140231,2.022506,-0.682622,0.03328,0.795507,1.873371,1.062993,1.203101,1.770264,1.609262,1.705117
Luis Suárez,1.277111,1.806906,1.780346,1.344378,-0.958773,-0.661723,-0.487978,0.825216,3.685573,1.807324,-0.728724,1.640346,1.586713,1.042401,1.356676,0.108969,1.385158,0.553777,1.448868,0.52698,-0.471939,1.289368,0.020598,1.037058,1.430521,2.143882,1.586162,1.608675,1.770264,1.879504,1.991155
Robert Lewandowski,1.277111,0.51647,1.45755,1.27987,-1.17897,-1.505821,-0.623914,0.958243,2.864632,1.639398,-0.825935,1.197416,1.971718,0.112071,1.356676,0.189941,0.71014,0.946267,0.635573,1.208369,0.92039,0.702857,1.339136,1.572406,1.359964,1.981576,1.353642,1.087223,1.307741,1.609262,1.87674
Gareth Bale,-0.179263,1.806906,1.538249,1.5379,-0.033947,-0.039756,-0.034859,-0.039461,2.700444,1.471472,0.146168,1.271237,1.779216,1.662621,1.635322,1.404525,1.760168,1.96674,0.554244,0.867674,-0.105537,0.629543,1.427039,1.639324,1.712749,1.765167,1.818682,1.608675,1.828079,1.271459,1.304664
Zlatan Ibrahimović,1.277111,1.806906,1.699647,1.408885,-1.619363,-1.150411,-0.669226,1.224298,2.372067,1.471472,-1.749433,1.566524,2.356724,0.980379,1.449558,1.080636,-0.039881,0.318283,0.310255,1.974932,-1.864268,1.289368,0.284306,1.237813,1.853863,1.927474,1.702422,1.376918,1.654633,2.284867,2.277192
Jérôme Boateng,1.277111,-0.773965,0.247064,0.118733,1.683587,1.692866,1.641682,1.09127,2.207879,-0.711568,1.361297,1.049772,1.875467,0.546225,0.613621,1.404525,0.33513,0.867769,0.310255,1.804585,-0.911622,-0.763419,0.548013,1.639324,0.866064,-1.10225,-0.041475,-0.129499,-1.29395,-0.755355,-0.011109
Eden Hazard,1.277111,1.806906,1.780346,1.795931,-1.17897,-1.372542,-1.303592,-0.77111,2.372067,1.415496,-0.728724,1.787989,1.875467,1.228467,1.449558,1.404525,1.760168,1.338757,0.472914,-0.409931,1.726475,1.729251,-0.858428,-0.301312,0.866064,1.440554,1.353642,1.376918,1.481187,1.947064,1.476287
Luka Modrić,1.277111,1.806906,1.861045,1.344378,0.626643,0.893194,1.097939,-0.239001,2.864632,1.079644,0.972456,2.083276,0.912953,1.104423,2.192613,1.647442,0.560136,0.082789,0.960891,-1.006147,2.019596,1.802564,-0.155207,-0.435149,0.442722,0.899532,1.353642,1.203101,1.365556,1.541701,1.190249


We now have a scaled data frame of the fifa attributes for 5000+ players. We're going to use this data to find which players are similar to one another.

In [11]:
#Initialize the NearestNeighbors algorithm with 6 neighbors and metric = euclidena
nn = NearestNeighbors(n_neighbors=6, metric="euclidean")

#Fit nn on the fifa_ss data
nn.fit(fifa_ss)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=1, n_neighbors=6, p=2, radius=1.0)

Uses the euclidean distance formula to calculate distances between the player vectors. The shorter the distance between two players, the more similar they are.

In [12]:
#Assign data from players to variables

#Pogba is midfielder
pogba = fifa_ss.loc["Paul Pogba",].values.reshape(1, -1)
#Neymar is a forward
neymar = fifa_ss.loc["Neymar", ].values.reshape(1, -1)
#Chiellini is a defender
chiellini = fifa_ss.loc["Giorgio Chiellini",].values.reshape(1, -1)

Use the kneighbors method to find the neighbors for each player

In [13]:
#Pass player variables in to nn.kneighbors

pogba_neighbors = nn.kneighbors(pogba)
neymar_neighbors = nn.kneighbors(neymar)
chiellini_neighbors = nn.kneighbors(chiellini)

In [14]:
#We can also pass in the whole dataset into nn object
everyones_neighbors = nn.kneighbors(fifa_ss)

In [15]:
#Look at pogba_neighbors
pogba_neighbors

(array([[ 0.        ,  3.35874646,  3.69872742,  3.7555846 ,  3.8697611 ,
          3.89096401]]), array([[ 15,  49, 227, 446, 102,   5]]))

This outputs two numpy arrays.

In [16]:
#Array number one
pogba_neighbors[0]

array([[ 0.        ,  3.35874646,  3.69872742,  3.7555846 ,  3.8697611 ,
         3.89096401]])

These are the distances between Paul Pogba and the six closest neighbors. You'll notice that there's a 0 and that's because Paul Pogba is already in the data, so naturally Paul Pogba is gonna be closest to Paul Pogba.

In [17]:
#Array number two
pogba_neighbors[1]

array([[ 15,  49, 227, 446, 102,   5]])

These are the index values of the 6 closest neighbors to Paul Pogba.

Let's use these index values to see the closest players to Pogba.

In [18]:
#Create variable for index values of the closest players to pogba
pogba_neighbors_index = pogba_neighbors[1][0]

In [19]:
#Pass pogba_neighbors_index into the fifa_ss dataframe and call .index to only returns names
fifa_ss.iloc[pogba_neighbors_index,].index

Index([u'Paul Pogba', u'Claudio Marchisio', u'Axel Witsel',
       u'Kevin-Prince Boateng', u'Arda Turan', u'Gareth Bale'],
      dtype='object', name=u'Name')

These are the 5 closest players to Paul Pogba (ignoring Paul Pogba of course).

Let's do this again for Neymar and Chiellini

In [20]:
#Neymar
neymar_neighbors_index = neymar_neighbors[1][0]
fifa_ss.iloc[neymar_neighbors_index,].index

Index([u'Neymar', u'Eden Hazard', u'Ousmane Dembélé', u'Franck Ribéry',
       u'Giovani dos Santos', u'Pedro'],
      dtype='object', name=u'Name')

In [21]:
chiellini_neighbors_index = chiellini_neighbors[1][0]
fifa_ss.iloc[chiellini_neighbors_index,].index

Index([u'Giorgio Chiellini', u'Sokratis', u'Marvin Matip', u'Miranda',
       u'Marcos Angeleri', u'Andrea Barzagli'],
      dtype='object', name=u'Name')

## Amazon

My article about how I used this algorithm to determine which city Amazon should choose as the home of its second HQ: https://www.thinkful.com/blog/data-science-just-figured-out-where-amazons-hq2-should-be/

In [22]:
#Load in data, which is a pickle file

cities = pd.read_csv("../data/amazon_cities_data.csv", index_col="cities")
cities.head()

Unnamed: 0_level_0,sprawl,diversity,business_score,fiber_coverage,excellent_education,percent_bachelors,life_quality,mobile_network_score,transit_scores
cities,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Atlanta,41.0,67.26,4.0,11.2,2,35.8,6.6,97.2,7.7
Austin,102.44,69.91,3.7,12.1,1,41.7,7.8,97.2,5.47
Baltimore,115.62,64.92,1.7,60.2,2,37.3,6.3,96.6,8.52
Boston,126.93,68.96,3.0,38.8,7,44.6,7.1,95.3,9.44
Charlotte,70.45,69.56,3.3,11.3,0,32.2,7.1,95.8,4.33


This data definitely needs scaling

In [23]:
#Intialize scaler
ss = StandardScaler()

#Fit and transform data using scaler
cities_ss = ss.fit_transform(cities)

cities_ss = pd.DataFrame(cities_ss, index=cities.index, columns=cities.columns)
cities_ss.head()

Unnamed: 0_level_0,sprawl,diversity,business_score,fiber_coverage,excellent_education,percent_bachelors,life_quality,mobile_network_score,transit_scores
cities,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Atlanta,-1.92278,-0.068233,1.198111,-0.627189,0.39104,0.304105,-0.079936,0.687416,0.347833
Austin,-0.062954,0.908403,0.890527,-0.564271,-0.260694,1.330209,2.318146,0.687416,-0.976476
Baltimore,0.336012,-0.930621,-1.160029,2.798383,0.39104,0.564979,-0.679457,-0.00661,0.834798
Boston,0.678373,0.558288,0.172833,1.302317,3.649711,1.834566,0.919265,-1.510333,1.38115
Charlotte,-1.031311,0.779413,0.480416,-0.620198,-0.912428,-0.321993,0.919265,-0.931978,-1.653477


First we need to determine the best city. I did by creating an array with the max values for each column.

In [24]:
best_city = cities_ss.max()
best_city

sprawl                  2.991955
diversity               1.579149
business_score          1.505694
fiber_coverage          2.798383
excellent_education     3.649711
percent_bachelors       2.582405
life_quality            2.318146
mobile_network_score    1.497114
transit_scores          1.476167
dtype: float64

This technique assumes that more = equals better. For example if we had a feature such as rate of cancer, then that would present an issue here because the lower the rate, the better the value.

Time to fit the data on a NearestNeighbors object

In [25]:
#Intialize algorithm with 5 neighbors
nn = NearestNeighbors(n_neighbors=5, metric="euclidean")
#Fit data
nn.fit(cities_ss)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [26]:
#Pass in best city to see the top five closest cities

best_city_neighbors_index = nn.kneighbors(best_city.values.reshape(1, -1))[1][0]

In [27]:
#Pass in best_city_neighbors_index object into cities_ss
cities_ss.iloc[best_city_neighbors_index,].index

Index([u'Washington DC', u'Boston', u'New York', u'Philadelphia', u'Austin'], dtype='object', name=u'cities')

Washington DC is the winner of the 2017 Amazon HQ sweepstakes!!!
<br><br>
Boston comes in 2nd, and New York in 3rd.