## Day 49 Lecture 1 Assignment

In this assignment, we will apply GMM (Gaussian Mixture Modeling) clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.special import entr

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [2]:
# answer goes here
df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv')

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2141 entries, 0 to 2140
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Seas    2141 non-null   int64  
 1   Player  2141 non-null   object 
 2   Pos     2141 non-null   object 
 3   Age     2141 non-null   int64  
 4   Tm      2141 non-null   object 
 5   G       2141 non-null   int64  
 6   GS      2141 non-null   int64  
 7   MP      2141 non-null   float64
 8   FG      2141 non-null   float64
 9   FGA     2141 non-null   float64
 10  FG%     2131 non-null   float64
 11  3P      2141 non-null   float64
 12  3PA     2141 non-null   float64
 13  3P%     1967 non-null   float64
 14  2P      2141 non-null   float64
 15  2PA     2141 non-null   float64
 16  2P%     2110 non-null   float64
 17  eFG%    2131 non-null   float64
 18  FT      2141 non-null   float64
 19  FTA     2141 non-null   float64
 20  FT%     2037 non-null   float64
 21  ORB     2141 non-null   float64
 22  

Unnamed: 0,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2019,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,1.3,4.1,0.323,0.5,1.0,0.5,0.487,0.4,0.4,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,2019,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,0.2,1.5,0.133,0.2,0.3,0.667,0.278,0.7,1.0,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,2019,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,0.7,2.2,0.338,0.4,1.1,0.361,0.459,0.2,0.3,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,0.0,0.0,0.0,6.0,10.1,0.596,0.595,1.8,3.7,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,0.0,0.2,0.2,3.4,5.7,0.588,0.579,2.0,2.8,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


The goal is to cluster these player-seasons to identify potential player "archetypes".  
The pre-processing steps will be identical to what we previously did for K-means.

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [3]:
# answer goes here
nba = df.loc[(df['GS'] >= 20) & (df['MP'] >= 10)]

nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 804 entries, 3 to 2139
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Seas    804 non-null    int64  
 1   Player  804 non-null    object 
 2   Pos     804 non-null    object 
 3   Age     804 non-null    int64  
 4   Tm      804 non-null    object 
 5   G       804 non-null    int64  
 6   GS      804 non-null    int64  
 7   MP      804 non-null    float64
 8   FG      804 non-null    float64
 9   FGA     804 non-null    float64
 10  FG%     804 non-null    float64
 11  3P      804 non-null    float64
 12  3PA     804 non-null    float64
 13  3P%     771 non-null    float64
 14  2P      804 non-null    float64
 15  2PA     804 non-null    float64
 16  2P%     804 non-null    float64
 17  eFG%    804 non-null    float64
 18  FT      804 non-null    float64
 19  FTA     804 non-null    float64
 20  FT%     804 non-null    float64
 21  ORB     804 non-null    float64
 22  D

Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [4]:
# answer goes here
X = nba.loc[:,['PTS','TRB','AST','STL','BLK','TOV']]

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 804 entries, 3 to 2139
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PTS     804 non-null    float64
 1   TRB     804 non-null    float64
 2   AST     804 non-null    float64
 3   STL     804 non-null    float64
 4   BLK     804 non-null    float64
 5   TOV     804 non-null    float64
dtypes: float64(6)
memory usage: 44.0 KB


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [5]:
# answer goes here
scale = StandardScaler()
X_scale = pd.DataFrame(scale.fit_transform(X), columns=X.columns)

Run both K-Means and Gaussian mixtures modeling twice, once with 3 cluster and once with 7 clusters. Print out the resulting means for all 4 scenarios (KM+3, GMM+3, KM+7, GMM+7). When printing the means, transform the scaled versions back into their corresponding unscaled values. 

What "archetypes" do you see? Are the archetypes identified by GMM similar to those identified by K-Means? How do the means of GMM differ from those of K-Means?

In [6]:
# kmeans 3 cluster
kmeans = KMeans(n_clusters=3)

kmeans.fit_predict(X_scale)

centers = pd.DataFrame(kmeans.cluster_centers_,columns=X.columns)

centers_inverse = pd.DataFrame(scale.inverse_transform(centers), columns=X.columns)
kmeans_3_cluster = centers_inverse.style.background_gradient()
centers_inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,10.505405,3.949775,2.063739,0.80473,0.374324,1.235135
1,13.735359,8.605525,1.914917,0.777348,1.230387,1.668508
2,19.49162,5.293855,5.934078,1.35419,0.486034,2.797765


In [7]:
#gmm 3 cluster
gmm3 = GaussianMixture(n_components=3)
gmm3.fit(X_scale)

gmm_centers = pd.DataFrame(gmm3.means_, columns=X_scale.columns)
gmm_centers_inverse = pd.DataFrame(scale.inverse_transform(gmm_centers), columns=X_scale.columns)
gmm_centers_inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,10.228179,4.504354,1.579742,0.779853,0.433706,1.107959
1,14.550186,8.524727,2.279152,0.777242,1.230885,1.842771
2,15.895675,4.173718,4.807518,1.176113,0.370977,2.244243


In [8]:
#k means 7 cluster
kmeans = KMeans(n_clusters=7)

kmeans.fit_predict(X_scale)

centers = pd.DataFrame(kmeans.cluster_centers_,columns=X.columns)

centers_inverse = pd.DataFrame(scale.inverse_transform(centers), columns=X.columns)
kmeans_7_cluster = centers_inverse.style.background_gradient()
centers_inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,11.285714,4.885714,2.130769,1.354945,0.498901,1.3
1,11.52649,7.282119,1.688079,0.655629,0.936424,1.39404
2,24.603226,8.374194,7.803226,1.603226,0.809677,3.941935
3,18.556452,4.720161,5.804032,1.33629,0.402419,2.630645
4,14.293452,3.707738,3.166667,0.842857,0.27619,1.714286
5,17.772881,10.484746,2.3,0.923729,1.759322,2.137288
6,7.546667,3.490556,1.372222,0.592222,0.355,0.888333


In [9]:
#gmm 7 cluster
gmm7 = GaussianMixture(n_components=7)
gmm7.fit(X_scale)

gmm_centers = pd.DataFrame(gmm7.means_, columns=X_scale.columns)
gmm_centers_inverse = pd.DataFrame(scale.inverse_transform(gmm_centers), columns=X_scale.columns)
gmm_centers_inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,16.530193,9.282297,2.92397,0.896829,1.359077,2.120127
1,17.983791,3.93951,5.635105,1.134762,0.300336,2.542821
2,7.971222,3.468278,1.338104,0.625561,0.348648,0.883253
3,13.010276,3.713215,2.881514,0.964244,0.293982,1.560183
4,21.69367,8.35804,7.767062,1.587919,0.893005,3.758962
5,11.398006,6.738907,1.50748,0.704458,0.836242,1.320188
6,18.240757,5.057475,5.130483,1.75195,0.522024,2.413584


Overall the archetypes seem pretty similiar between K-Means and GMM. You do see more differences between groups with 7 clusters vs 3 clusters. However, in the 3 cluster's points category K-means has a much higher number than GMM. This is a result of the particlur centroid for this cluster having a much higher value than the cluster mean. K-means uses a particular data point as the centroid where as GMM uses the mean of the entire cluster. So we will see some differences if the K-Means centroid differs much from the acutal cluster mean. 

Predict the likelihood of each player belonging to one of the 3 clusters using the GMM model. Then, calculate the entropy for each set of predicted probabilities. 

We will use entropy as a measure of how confident we are in the predicted class label. If we had no confidence in our prediction, we would assign 33% probability to each class, while if we were totally confident, we would assign 100% to one class. Entropy would be at a maximum in the "no confidence" scenario and a minimum in the "full confidence" scenario, which makes it a reasonable way to quantify our uncertainty in our prediction. There are certainly other methods as well; feel free to experiment with them if desired.

Which five predicted labels are we least confident about? Which five are we most confident about? Print out the associated details (season, player name, stats, etc.) from those players.

In [10]:
# answer goes here
X = df[['PTS','TRB','AST','STL','BLK','TOV']]
X_scale = pd.DataFrame(scale.fit_transform(X), columns=X.columns)

gmm3 = GaussianMixture(n_components=3)
gmm3.fit(X_scale)

X["cluster"] = gmm3.fit_predict(X_scale)

probs = pd.DataFrame(gmm3.predict_proba(X_scale))
entropy = entr(probs)
entropy['entropy'] = entropy[0] + entropy[1] + entropy[2]
entropy['entropy']

X_probs = pd.concat([X, entropy], axis=1)
X_probs.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV,cluster,0,1,2,entropy
0,5.3,1.5,0.6,0.5,0.2,0.5,1,0.013432,0.021482,0.076853,0.111766
1,1.7,2.5,0.8,0.1,0.4,0.4,1,0.0397,0.011703,0.019749,0.071152
2,3.2,1.8,1.9,0.4,0.1,0.8,2,0.076065,0.353825,0.344508,0.774398
3,13.9,9.5,1.6,1.5,1.0,1.7,0,6e-06,6.6e-05,1.1e-05,8.3e-05
4,8.9,7.3,2.2,0.9,0.8,1.5,0,0.0166,0.067754,0.00193,0.086283


In [11]:
# least amount of confidence in predicited labels
X_probs.sort_values(by='entropy', ascending=False).head()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV,cluster,0,1,2,entropy
1590,3.8,3.1,1.0,0.7,0.0,0.2,0,0.36494,0.350629,0.36509,1.080659
915,11.8,4.5,1.2,1.0,0.4,1.1,2,0.366379,0.34707,0.365156,1.078605
2058,5.8,3.8,1.7,1.0,0.4,1.1,2,0.324037,0.367581,0.365148,1.056766
1445,14.0,4.6,2.0,0.7,0.3,1.1,2,0.352862,0.344786,0.350498,1.048145
1816,4.5,3.4,1.0,0.9,0.2,0.8,1,0.324016,0.355583,0.365798,1.045397


In [12]:
# most amount of confidence in predicited labels
X_probs.sort_values(by='entropy', ascending=True).head()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV,cluster,0,1,2,entropy
1580,6.0,10.0,1.0,0.0,6.0,2.0,0,-0.0,2.332941e-250,7.80459e-261,2.332941e-250
2117,14.2,11.8,0.4,0.6,3.7,1.9,0,-0.0,6.561753e-67,3.8963459999999997e-85,6.561753e-67
1578,4.0,5.5,0.5,0.0,3.0,1.0,0,-0.0,1.667081e-57,9.516034e-65,1.667081e-57
484,13.3,7.2,1.6,0.8,2.7,1.4,0,-0.0,2.1189579999999998e-37,3.2040210000000003e-43,2.118961e-37
940,22.7,6.6,1.2,0.8,2.4,1.9,0,-0.0,7.389275e-41,2.587124e-31,2.587124e-31
