# Clustering Lab

 
Based of the amazing work you did in the Movie Industry you've been recruited to the NBA! You are working as the VP of Analytics that helps support a head scout, Mr. Rooney, for the worst team in the NBA probably the Wizards. Mr. Rooney just heard about Data Science and thinks it can solve all the team's problems!!! He wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs! 

In this document you will work through a similar process that we did in class with the NBA data (NBA_Perf_22 and nba_salaries_22), merging them together.

Details: 

- Determine a way to use clustering to estimate based on performance if 
players are under or over paid, generally. 

- Then select players you believe would be best for your team and explain why. Do so in three categories: 
    * Examples that are not good choices (3 or 4) 
    * Several options that are good choices (3 or 4)
    * Several options that could work, assuming you can't get the players in the good category (3 or 4)

- You will decide the cutoffs for each category, so you should be able to explain why you chose them.

- Provide a well commented and clean report of your findings in a separate notebook that can be presented to Mr. Rooney, keeping in mind he doesn't understand...anything. Include a rationale for variables you included in the model, details on your approach and a overview of the results with supporting visualizations. 


Hints:

- Salary is the variable you are trying to understand 
- When interpreting you might want to use graphs that include variables that are the most correlated with Salary
- You'll need to scale the variables before performing the clustering
- Be specific about why you selected the players that you did, more detail is better
- Use good coding practices, comment heavily, indent, don't use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script,list/load libraries at the top and don't include libraries that aren't used. 
- Be careful for non-traditional characters in the players names, certain graphs won't work when these characters are included.


In [118]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import silhouette_score

from sklearn.preprocessing import MinMaxScaler

In [119]:
#Load data

salaries = pd.read_excel("C:\\Users\\sarah\\OneDrive\\Documents\\DS3001_ML\\nba_salaries_21.xlsx")
salaries.info()
#no salaries for 2 players 

salaries = salaries.dropna()
salaries.rename(columns={'2020-21': 'Salary'}, inplace=True)
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Player   489 non-null    object 
 1   2020-21  487 non-null    float64
dtypes: float64(1), object(1)
memory usage: 7.8+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 487 entries, 0 to 488
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  487 non-null    object 
 1   Salary  487 non-null    float64
dtypes: float64(1), object(1)
memory usage: 11.4+ KB


In [120]:
#Load data 

performance = pd.read_csv("C:\\Users\\sarah\\OneDrive\\Documents\\DS3001_ML\\nba2020-21-1.csv", encoding='latin')
performance = performance.dropna()


performance.info()
# G number of games 
# GS games started 
# MP minutes played 
# FG field goals 
# FGA field goals attempted 
# FG% field goal accuarcy 
# 3P three pointers 
# 3PA three pointers attempted 
# 3P% three pointers accuracy 
# 2P two pointers 
# 2PA two pointers attempted 
# 2P% two pointers accuracy 
# eFG% effective field goal percentage
# FT free throws 
# FTA free throws attempted 
# FT% free throws accuracy 
# ORB offensive rebounds 
# DRB defensive rebounds 
# TRB total rebounds 
# AST assists 
# STL steals 
# BLK blocks 
# TOV turnovers; want to be low 
# PF personal foul; want to be low 
# PTS points 

<class 'pandas.core.frame.DataFrame'>
Index: 451 entries, 0 to 511
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  451 non-null    object 
 1   Pos     451 non-null    object 
 2   Age     451 non-null    int64  
 3   Tm      451 non-null    object 
 4   G       451 non-null    int64  
 5   GS      451 non-null    int64  
 6   MP      451 non-null    int64  
 7   FG      451 non-null    int64  
 8   FGA     451 non-null    int64  
 9   FG%     451 non-null    float64
 10  3P      451 non-null    int64  
 11  3PA     451 non-null    int64  
 12  3P%     451 non-null    float64
 13  2P      451 non-null    int64  
 14  2PA     451 non-null    int64  
 15  2P%     451 non-null    float64
 16  eFG%    451 non-null    float64
 17  FT      451 non-null    int64  
 18  FTA     451 non-null    int64  
 19  FT%     451 non-null    float64
 20  ORB     451 non-null    int64  
 21  DRB     451 non-null    int64  
 22  TRB    

In [121]:
#combine salaries and performance data sets
#salaries.shape 
#performance.shape 

nba_data = pd.merge(salaries, performance, on='Player')

#drop duplicates 
nba_data = nba_data.drop_duplicates()
nba_data.head(25)

#drop na check 
nba_data = nba_data.dropna()

In [122]:
#drop object columns and NA; playing stats needed to determine high performing players 

to_drop = ['Pos', 'Tm']
nba_data = nba_data.drop(columns=to_drop)

nba_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 398 entries, 0 to 406
Data columns (total 28 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  398 non-null    object 
 1   Salary  398 non-null    float64
 2   Age     398 non-null    int64  
 3   G       398 non-null    int64  
 4   GS      398 non-null    int64  
 5   MP      398 non-null    int64  
 6   FG      398 non-null    int64  
 7   FGA     398 non-null    int64  
 8   FG%     398 non-null    float64
 9   3P      398 non-null    int64  
 10  3PA     398 non-null    int64  
 11  3P%     398 non-null    float64
 12  2P      398 non-null    int64  
 13  2PA     398 non-null    int64  
 14  2P%     398 non-null    float64
 15  eFG%    398 non-null    float64
 16  FT      398 non-null    int64  
 17  FTA     398 non-null    int64  
 18  FT%     398 non-null    float64
 19  ORB     398 non-null    int64  
 20  DRB     398 non-null    int64  
 21  TRB     398 non-null    int64  
 22  AST    

In [123]:
#determine variables that correlate most to salary need to normalize ? 

#use numeric only df to determine correlations 

numeric_nba = nba_data.drop('Player', axis = 1)

target_variable = 'Salary'

correlations = numeric_nba.corr()[target_variable]
most_correlated_variables = correlations.abs().sort_values(ascending=False)

most_correlated_variables

#most correlated variables are AST, PTS, TOV, FG, FT, FGA, FTA, 2PA, GS, 2P 

Salary    1.000000
AST       0.601389
PTS       0.592085
TOV       0.586419
FG        0.578326
FT        0.572090
FGA       0.570798
FTA       0.562054
2PA       0.545423
2P        0.533897
GS        0.532541
MP        0.452701
DRB       0.448950
STL       0.435021
3P        0.419560
3PA       0.417276
Age       0.400019
TRB       0.396658
PF        0.264027
BLK       0.211078
FT%       0.176538
ORB       0.171431
G         0.132392
FG%       0.105233
eFG%      0.103485
3P%       0.102641
2P%       0.037662
Name: Salary, dtype: float64

In [124]:
# normalized looks worse 

#numeric_nba = nba_data.drop('Player', axis = 1)

#normalize the numeric variables
#numeric_cols = numeric_nba.select_dtypes(include='int64').columns

#normalize numeric variables
#from sklearn import preprocessing
#scaler = preprocessing.MinMaxScaler()
#d = scaler.fit_transform(numeric_nba[numeric_cols]) 
#scaled_df = pd.DataFrame(d, columns=numeric_cols) 

#numeric_nba[numeric_cols] = scaled_df 

#target_variable = 'Salary'

#correlations = numeric_nba.corr()[target_variable]
#most_correlated_variables = correlations.abs().sort_values(ascending=False)

#most_correlated_variables

In [125]:
#columns to be clustered

clust_data_nba = nba_data[['AST','PTS','TOV','FG','FT','FGA' ,'FTA','2PA','2P','GS','MP', 'STL','3P','3PA','Age','TRB','PF','BLK','FT%','ORB','G','FG%','eFG%','3P%','2P%']]

scaler = MinMaxScaler()
clust_data_nba = scaler.fit_transform(clust_data_nba)

#kMeans algorithm with 2 centers 

np.random.seed(1)
kmeans_obj_nba = KMeans(n_clusters=2, random_state=1).fit(clust_data_nba)

print(kmeans_obj_nba.labels_) #cluster labels for each point in nba_data 




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.



[0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 1
 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 0
 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1
 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1
 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 0 0
 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1
 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 0
 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0
 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0
 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0
 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1]


In [126]:
print(kmeans_obj_nba.cluster_centers_) #coordinates of the cluster centers in the feature space
print(kmeans_obj_nba.labels_) # two clusters: 0 and 1 
print(kmeans_obj_nba.inertia_) #how compact the clusters are; want to be low 

[[0.0858744  0.12485079 0.10988814 0.13070542 0.0593736  0.1430165
  0.05834411 0.10803861 0.09272543 0.10066066 0.25688363 0.18412698
  0.10924392 0.12289431 0.36938272 0.12230733 0.2472091  0.06194194
  0.73408444 0.09814536 0.52197531 0.47345898 0.57067277 0.31462222
  0.64970013]
 [0.3055374  0.44697495 0.37925282 0.46723158 0.25631377 0.47452014
  0.26299777 0.41679399 0.36923942 0.65724106 0.6815978  0.4568309
  0.30174094 0.33079092 0.4165061  0.35734842 0.55354786 0.1660678
  0.78060116 0.27171976 0.82000642 0.54125151 0.63193171 0.34504624
  0.69678426]]
[0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 1
 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 0
 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1
 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1
 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 0 0
 1 0 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 

In [127]:
#understanding cluster centers and cluster labels 

clust_columns = ['AST','PTS','TOV','FG','FT','FGA' ,'FTA','2PA','2P','GS','MP', 'STL','3P','3PA','Age','TRB','PF','BLK','FT%','ORB','G','FG%','eFG%','3P%','2P%']
cluster_labels = kmeans_obj_nba.labels_

#add cluster labels to dataset
nba_data['Performance Cluster'] = cluster_labels
#nba_data.head(5)

#df with cluster centers and columns 
cluster_centers_df = pd.DataFrame(kmeans_obj_nba.cluster_centers_, columns=clust_columns)

print(cluster_centers_df)

#cluster 1 players are highest performing out of 0 and 1 

        AST       PTS       TOV        FG        FT       FGA       FTA  \
0  0.085874  0.124851  0.109888  0.130705  0.059374  0.143016  0.058344   
1  0.305537  0.446975  0.379253  0.467232  0.256314  0.474520  0.262998   

        2PA        2P        GS  ...       TRB        PF       BLK       FT%  \
0  0.108039  0.092725  0.100661  ...  0.122307  0.247209  0.061942  0.734084   
1  0.416794  0.369239  0.657241  ...  0.357348  0.553548  0.166068  0.780601   

        ORB         G       FG%      eFG%       3P%       2P%  
0  0.098145  0.521975  0.473459  0.570673  0.314622  0.649700  
1  0.271720  0.820006  0.541252  0.631932  0.345046  0.696784  

[2 rows x 25 columns]


In [128]:
#train the model with 3 clusters to visualize variables 

kmeans_obj_nba = KMeans(n_clusters=3, random_state=1).fit(clust_data_nba)

#add cluster labels to dataset
nba_data['Performance Cluster'] = kmeans_obj_nba.labels_

#3d scatter plot
fig = px.scatter_3d(nba_data, x='AST', y='PTS', z='FG', color='Salary',
                    title='3D Scatter Plot with Cluster Centers',
                    symbol='Performance Cluster', size_max=10, opacity=0.7)

#cluster centers 
fig.add_trace(px.scatter_3d(x=kmeans_obj_nba.cluster_centers_[:, 0],
                            y=kmeans_obj_nba.cluster_centers_[:, 1],
                            z=kmeans_obj_nba.cluster_centers_[:, 2],
                            symbol=['Cluster 0 Center', 'Cluster 1 Center', 'Cluster 2 Center'],
                            size=[10, 10, 10],
                            opacity=1).data[0])

fig.update_layout(legend=dict(x=0.8, y=0.9))

fig.show(renderer="browser")




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.



In [129]:
#df with cluster centers and columns 
cluster_centers_df = pd.DataFrame(kmeans_obj_nba.cluster_centers_, columns=clust_columns)

print(cluster_centers_df)

#cluster 0 players are highest performing out of 0, 1, and 2 

        AST       PTS       TOV        FG        FT       FGA       FTA  \
0  0.176199  0.267429  0.222397  0.281713  0.127958  0.296520  0.131603   
1  0.053195  0.075359  0.073258  0.079806  0.037574  0.086450  0.036265   
2  0.417177  0.591570  0.506339  0.612440  0.370039  0.618566  0.375876   

        2PA        2P        GS  ...       TRB        PF       BLK       FT%  \
0  0.233504  0.205158  0.346873  ...  0.256340  0.463667  0.129318  0.772034   
1  0.068366  0.060598  0.055767  ...  0.077465  0.155761  0.040033  0.706387   
2  0.565962  0.496717  0.836170  ...  0.413344  0.594148  0.177177  0.800000   

        ORB         G       FG%      eFG%       3P%       2P%  
0  0.209839  0.774286  0.511830  0.617702  0.340326  0.677512  
1  0.063698  0.389476  0.468316  0.555492  0.299972  0.649834  
2  0.287942  0.845679  0.544367  0.626517  0.349753  0.689942  

[3 rows x 25 columns]


In [130]:
#calculate total variance explained for quality of clustering 

#higher tve indicates better quality of clustering 

#calculate total sum of squares
#squared differences between each data point and the mean of the entire dataset 
tss = np.sum((clust_data_nba - clust_data_nba.mean(axis=0))**2)

#calculate within-cluster sum of squares
#squared differences between each data point and the centroid of its assigned cluster
wss = np.sum((clust_data_nba - kmeans_obj_nba.cluster_centers_[kmeans_obj_nba.labels_])**2)

#calculate tve
tve = (tss - wss) / tss

print(tve)

0.5403469163784919


In [131]:
#calculate silhouette scores for quality of clustering 

#high score indicates data point is well matched to its cluster and poorly matched to neighboring clusters
#score 0 indicates overlapping clusters
#negative score indicates data points assigned to the wrong cluster

silhouette_avg = silhouette_score(clust_data_nba, kmeans_obj_nba.labels_)

print(silhouette_avg)

0.2636769074613832


In [132]:
#elbow method for best k 

wcss = []
for i in range(1, 11):
    kmeans_obj_nba = KMeans(n_clusters=i, random_state=1).fit(clust_data_nba)
    wcss.append(kmeans_obj_nba.inertia_)


elbow_data_nba = pd.DataFrame({"k": range(1, 11), "wcss": wcss})
fig = px.line(elbow_data_nba, x="k", y="wcss", title="Elbow Method")

fig.show(renderer="browser")




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




KMeans is known to have a memory leak on Windows with MKL, when the

In [133]:
#silhouette scores for best k 

silhouette_scores = []
for k in range(2, 11):
    kmeans_obj = KMeans(n_clusters=k, algorithm="auto", random_state=1).fit(clust_data_nba)
    silhouette_scores.append(silhouette_score(clust_data_nba, kmeans_obj.labels_))

best_nc = silhouette_scores.index(max(silhouette_scores))+2


#plot the silhouette scores
fig = go.Figure(data=go.Scatter(x=list(range(2, 11)), y=silhouette_scores))

fig.show(renderer="browser")


#best is k is 2 




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.




algorithm='auto' is deprecated, it will be removed in 1.3. Using 'lloyd' instead.


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available th

In [134]:
#retrain the model with 2 clusters

kmeans_obj_nba = KMeans(n_clusters=2, random_state=1).fit(clust_data_nba)

#add cluster labels to dataset
nba_data['Performance Cluster'] = kmeans_obj_nba.labels_

#2d scatter plot
fig = px.scatter(nba_data, x='AST', y='PTS', color='Salary',
                 title='2D Scatter Plot with Cluster Centers',
                 symbol='Performance Cluster', size_max=10, opacity=0.7)

#cluster centers
fig.add_trace(px.scatter(x=kmeans_obj_nba.cluster_centers_[:, 0],
                         y=kmeans_obj_nba.cluster_centers_[:, 1],
                         symbol=['Cluster 0 Center', 'Cluster 1 Center'],
                         size=[10, 10],
                         opacity=1).data[0])

fig.update_layout(legend=dict(x=0.8, y=0.9))

fig.show(renderer="browser")




KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.



In [135]:
#df with cluster centers and columns 
cluster_centers_df = pd.DataFrame(kmeans_obj_nba.cluster_centers_, columns=clust_columns)

print(cluster_centers_df)

#cluster 1 players are high performing out of 0 and 1 

        AST       PTS       TOV        FG        FT       FGA       FTA  \
0  0.085874  0.124851  0.109888  0.130705  0.059374  0.143016  0.058344   
1  0.305537  0.446975  0.379253  0.467232  0.256314  0.474520  0.262998   

        2PA        2P        GS  ...       TRB        PF       BLK       FT%  \
0  0.108039  0.092725  0.100661  ...  0.122307  0.247209  0.061942  0.734084   
1  0.416794  0.369239  0.657241  ...  0.357348  0.553548  0.166068  0.780601   

        ORB         G       FG%      eFG%       3P%       2P%  
0  0.098145  0.521975  0.473459  0.570673  0.314622  0.649700  
1  0.271720  0.820006  0.541252  0.631932  0.345046  0.696784  

[2 rows x 25 columns]


In [136]:
#calculate total variance explained for quality of clustering 
#higher tve indicates better quality of clustering 

tss = np.sum((clust_data_nba - clust_data_nba.mean(axis=0))**2)

wss = np.sum((clust_data_nba - kmeans_obj_nba.cluster_centers_[kmeans_obj_nba.labels_])**2)

tve = (tss - wss) / tss

print(tve)

#less variance explained than 3 clusters 

0.4168522070428186


In [137]:
#calculate silhouette scores for quality of clustering 

#high score indicates data point is well matched to its cluster and poorly matched to neighboring clusters
#score 0 indicates overlapping clusters
#negative score indicates data points assigned to the wrong cluster

silhouette_avg = silhouette_score(clust_data_nba, kmeans_obj_nba.labels_)

print(silhouette_avg)

#higher silhouette than 3 clusters 

0.34477019986900226


In [154]:
#good choices for team 

nba_data['Performance Cluster'] = kmeans_obj_nba.labels_

low_salary = nba_data.sort_values(by='Salary')
low_salary.head(10)

low_salary_good_player = low_salary[low_salary['Performance Cluster'] == 1]
low_salary_good_player.head(10)

#looking for players with low salary and in the higher performing cluster, cluster 1 

Unnamed: 0,Player,Salary,Age,G,GS,MP,FG,FGA,FG%,3P,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Performance Cluster
167,Jae'Sean Tate,1445967.0,25,34,22,933,135,252,0.536,22,...,64,115,179,57,32,18,40,104,335,1
291,Naz Reid,1517981.0,21,34,13,699,158,307,0.515,32,...,39,123,162,44,21,41,38,104,403,1
159,Isaiah Roby,1517981.0,22,30,12,640,99,190,0.521,18,...,46,117,163,53,25,17,40,82,258,1
256,Luguentz Dort,1517981.0,21,35,35,1040,149,384,0.388,68,...,24,97,121,51,31,10,52,85,432,1
151,Hamidou Diallo,1663861.0,22,32,5,761,143,297,0.481,12,...,37,128,165,77,31,12,49,83,381,1
39,Bruce Brown,1663861.0,24,33,20,698,118,200,0.59,9,...,51,103,154,46,26,10,26,69,283,1
171,Jalen Brunson,1663861.0,24,30,7,748,139,262,0.531,39,...,6,99,105,101,16,0,38,45,381,1
285,Monte Morris,1663861.0,25,35,9,949,142,298,0.477,40,...,7,69,76,118,24,10,24,33,370,1
227,Kendrick Nunn,1663861.0,25,28,16,839,152,330,0.461,58,...,14,80,94,76,33,9,50,54,398,1
141,Gary Trent Jr.,1663861.0,22,33,20,1030,183,433,0.423,101,...,19,52,71,49,25,5,24,53,502,1


In [155]:
#bad choices for team 

nba_data['Performance Cluster'] = kmeans_obj_nba.labels_

high_salary = nba_data.sort_values(by='Salary', ascending=False)
#high_salary.head(5)

high_salary_bad_player = high_salary[high_salary['Performance Cluster'] == 0]
high_salary_bad_player.head(10)

Unnamed: 0,Player,Salary,Age,G,GS,MP,FG,FGA,FG%,3P,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Performance Cluster
178,James Harden,40824000.0,31,8,8,290,60,135,0.444,25,...,5,36,41,83,7,6,34,14,198,0
30,Blake Griffin,36595996.0,31,20,20,626,81,222,0.365,39,...,7,97,104,77,14,2,32,42,245,0
235,Kevin Love,31300000.0,32,2,2,46,6,18,0.333,3,...,2,10,12,5,1,0,3,1,19,0
59,CJ McCollum,29354152.0,29,13,13,440,123,260,0.473,63,...,7,44,51,65,17,4,13,28,347,0
303,Otto Porter,28489239.0,27,16,6,372,64,144,0.444,28,...,20,80,100,32,9,3,15,25,186,0
246,LaMarcus Aldridge,24000000.0,35,21,18,544,115,248,0.464,27,...,17,77,94,36,8,18,20,36,288,0
393,Victor Oladipo,21000000.0,28,9,9,300,64,152,0.421,25,...,1,50,51,38,15,2,18,23,180,0
394,Victor Oladipo,21000000.0,28,15,15,486,110,284,0.387,35,...,7,67,74,71,21,9,35,33,299,0
76,Danilo Gallinari,19500000.0,32,24,2,525,81,207,0.391,47,...,5,76,81,37,13,3,27,46,280,0
140,Gary Harris,19160714.0,26,19,19,581,69,156,0.442,24,...,13,34,47,32,17,4,14,37,184,0


In [149]:
#Use the model to select players for Mr. Rooney to consider

#Four options that are good choices are Jae’Sean Tate, Naz Reid, Isaiah Roby, and Luguentz Dort. 
#They are high performing players that are the least paid in cluster 1. 
#Hamidou Diallo and Bruce Brown are also good options as they are next least paid in cluster 1. 

#Four options that are bad choices are James Harden, Blake Griffin, Kevin Love, and CJ McCollum. 
#They are low performing players that are the most paid in cluster 0. 