## Day 49 Lecture 1 Assignment

In this assignment, we will apply GMM (Gaussian Mixture Modeling) clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.special import entr

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [2]:
def get_df(url):
  df = pd.read_csv(url)
  return df

In [3]:
# answer goes here
# answer goes here
NBA_df = get_df('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv')
NBA_df.head()

Unnamed: 0,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2019,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,1.3,4.1,0.323,0.5,1.0,0.5,0.487,0.4,0.4,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,2019,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,0.2,1.5,0.133,0.2,0.3,0.667,0.278,0.7,1.0,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,2019,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,0.7,2.2,0.338,0.4,1.1,0.361,0.459,0.2,0.3,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,0.0,0.0,0.0,6.0,10.1,0.596,0.595,1.8,3.7,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,0.0,0.2,0.2,3.4,5.7,0.588,0.579,2.0,2.8,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


The goal is to cluster these player-seasons to identify potential player "archetypes".  
The pre-processing steps will be identical to what we previously did for K-means.

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [4]:
# answer goes here
gb = NBA_df.groupby(by='Player')['MP'].mean()
players =  gb[gb >= 10].index

NBA = NBA_df[(NBA_df['Player'].isin(players)) & (NBA_df['GS'] >= 20)]
NBA.head()


Unnamed: 0,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
3,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,0.0,0.0,0.0,6.0,10.1,0.596,0.595,1.8,3.7,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,0.0,0.2,0.2,3.4,5.7,0.588,0.579,2.0,2.8,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9
7,2019,LaMarcus Aldridge,C,33,SAS,81,81,33.2,8.4,16.3,0.519,0.1,0.5,0.238,8.3,15.8,0.528,0.522,4.3,5.1,0.847,3.1,6.1,9.2,2.4,0.5,1.3,1.8,2.2,21.3
10,2019,Jarrett Allen,C,20,BRK,80,80,26.2,4.2,7.1,0.59,0.1,0.6,0.133,4.1,6.5,0.629,0.595,2.5,3.5,0.709,2.4,6.0,8.4,1.4,0.5,1.5,1.3,2.3,10.9
12,2019,Al-Farouq Aminu,PF,28,POR,81,81,28.3,3.2,7.3,0.433,1.2,3.5,0.343,2.0,3.9,0.514,0.514,1.9,2.1,0.867,1.4,6.1,7.5,1.3,0.8,0.4,0.9,1.8,9.4


Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [5]:
# answer goes here

# answer goes here
cols = ['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV']
NBA = NBA[cols]
NBA.head()



Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
3,13.9,9.5,1.6,1.5,1.0,1.7
4,8.9,7.3,2.2,0.9,0.8,1.5
7,21.3,9.2,2.4,0.5,1.3,1.8
10,10.9,8.4,1.4,0.5,1.5,1.3
12,9.4,7.5,1.3,0.8,0.4,0.9


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [6]:
# answer goes here
# answer goes here
scale = StandardScaler()
X_std = scale.fit_transform(NBA)



Run both K-Means and Gaussian mixtures modeling twice, once with 3 cluster and once with 7 clusters. Print out the resulting means for all 4 scenarios (KM+3, GMM+3, KM+7, GMM+7). When printing the means, transform the scaled versions back into their corresponding unscaled values. 

What "archetypes" do you see? Are the archetypes identified by GMM similar to those identified by K-Means? How do the means of GMM differ from those of K-Means?

In [7]:
# answer goes here
# Defining the k-means
KNBA= KMeans(n_clusters=3)

# Fit model
KNBA.fit(X_std)
KNBA_df = pd.DataFrame(scale.inverse_transform(KNBA.cluster_centers_), columns=['points', 'total rebounds', 'assists', 'steals', 'blocks', 'turnovers'])
KNBA_df.style.background_gradient()




Unnamed: 0,points,total rebounds,assists,steals,blocks,turnovers
0,13.735359,8.605525,1.914917,0.777348,1.230387,1.668508
1,19.49162,5.293855,5.934078,1.35419,0.486034,2.797765
2,10.490787,3.945393,2.06,0.803371,0.374607,1.233034


In [8]:
# answer goes here
# Defining the k-means
KNBA= KMeans(n_clusters=7)

# Fit model
KNBA.fit(X_std)
KNBA_df = pd.DataFrame(scale.inverse_transform(KNBA.cluster_centers_), columns=['points', 'total rebounds', 'assists', 'steals', 'blocks', 'turnovers'])
KNBA_df.style.background_gradient()





Unnamed: 0,points,total rebounds,assists,steals,blocks,turnovers
0,11.52381,7.321769,1.67415,0.652381,0.938776,1.395238
1,18.717742,4.824194,5.78871,1.342742,0.408871,2.65
2,11.410345,4.978161,2.085057,1.364368,0.511494,1.294253
3,14.15,3.68046,3.198851,0.855172,0.274138,1.705747
4,24.696552,8.3,7.989655,1.596552,0.824138,4.024138
5,7.566304,3.498913,1.365761,0.592935,0.36413,0.8875
6,17.755,10.473333,2.343333,0.921667,1.743333,2.14


In [11]:
# answer goes here
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=3, random_state=123)

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

# Fit model
gmm_df = pd.DataFrame(scale.inverse_transform(gmm_cluster.means_), columns=['points', 'total rebounds', 'assists', 'steals', 'blocks', 'turnovers'])
gmm_df.style.background_gradient()




Unnamed: 0,points,total rebounds,assists,steals,blocks,turnovers
0,10.049039,5.334392,1.451168,0.678767,0.585622,1.132344
1,17.008143,8.379988,4.257382,1.086338,1.17381,2.445213
2,14.594656,3.762327,3.678398,1.083023,0.317294,1.860359


In [12]:
# answer goes here
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=7, random_state=123)

# Fit model
clusters = gmm_cluster.fit_predict(X_std)

# Fit model
gmm_df = pd.DataFrame(scale.inverse_transform(gmm_cluster.means_), columns=['points', 'total rebounds', 'assists', 'steals', 'blocks', 'turnovers'])
gmm_df.style.background_gradient()




Unnamed: 0,points,total rebounds,assists,steals,blocks,turnovers
0,8.314251,5.624617,1.268487,0.698495,0.773196,1.067766
1,18.224436,7.778196,5.33244,1.108604,0.97994,2.762853
2,11.496538,4.595601,2.66983,1.278338,0.471027,1.511002
3,16.823074,3.79788,5.044925,1.18712,0.28791,2.30713
4,13.964476,9.472603,1.555343,0.706959,1.365479,1.642379
5,8.771707,2.841567,1.677136,0.732255,0.263761,0.952497
6,14.274129,5.012363,1.871258,0.745065,0.386179,1.418233


Interesting. I see the same archetypes as before, except there's a different in understanding the "team player" aspect of things. That is, in gaussian; I see that the players with most scores are grouped together as also being the ones that are contributing in all those other aspects. 

Predict the likelihood of each player belonging to one of the 3 clusters using the GMM model. Then, calculate the entropy for each set of predicted probabilities. 

We will use entropy as a measure of how confident we are in the predicted class label. If we had no confidence in our prediction, we would assign 33% probability to each class, while if we were totally confident, we would assign 100% to one class. Entropy would be at a maximum in the "no confidence" scenario and a minimum in the "full confidence" scenario, which makes it a reasonable way to quantify our uncertainty in our prediction. There are certainly other methods as well; feel free to experiment with them if desired.

Which five predicted labels are we least confident about? Which five are we most confident about? Print out the associated details (season, player name, stats, etc.) from those players.

In [31]:
# answer goes here

# answer goes here
# Defining the agglomerative clustering
gmm_cluster = GaussianMixture(n_components=3, random_state=123)

# Fit model
gmm_cluster.fit_predict(X_std)
clusters = gmm_cluster.predict_proba(X_std)

gmm_df = pd.concat([pd.DataFrame(clusters), NBA_df], join='inner', axis=1)
gmm_df

Unnamed: 0,0,1,2,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,5.393271e-01,0.460667,6.229032e-06,2019,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,1.3,4.1,0.323,0.5,1.0,0.500,0.487,0.4,0.4,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,9.563940e-01,0.043587,1.920420e-05,2019,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,0.2,1.5,0.133,0.2,0.3,0.667,0.278,0.7,1.0,0.700,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,2.929699e-01,0.707030,1.043125e-13,2019,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,0.7,2.2,0.338,0.4,1.1,0.361,0.459,0.2,0.3,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,8.761998e-01,0.123800,2.603168e-17,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,0.0,0.0,0.000,6.0,10.1,0.596,0.595,1.8,3.7,0.500,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,9.965320e-01,0.003458,9.625118e-06,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,0.0,0.2,0.200,3.4,5.7,0.588,0.579,2.0,2.8,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
800,7.527314e-20,0.006723,9.932775e-01,2018,Enes Kanter,C,25,NYK,71,71,25.8,5.9,10.0,0.592,0.0,0.0,0.000,5.9,10.0,0.594,0.592,2.2,2.6,0.848,3.8,7.1,11.0,1.5,0.5,0.5,1.7,2.6,14.1
801,1.076958e-01,0.000387,8.919168e-01,2018,Luke Kennard,SG,21,DET,73,9,20.0,2.8,6.4,0.443,1.1,2.7,0.415,1.7,3.7,0.463,0.530,0.9,1.0,0.855,0.3,2.1,2.4,1.7,0.6,0.2,0.9,1.2,7.6
802,9.860148e-01,0.013984,1.099454e-06,2018,Michael Kidd-Gilchrist,SF,24,CHO,74,74,25.0,3.8,7.6,0.504,0.0,0.0,0.000,3.8,7.5,0.505,0.504,1.6,2.3,0.684,1.1,2.9,4.1,1.0,0.7,0.4,0.7,1.9,9.2
803,7.312385e-01,0.268067,6.949682e-04,2018,Sean Kilpatrick,SG,28,TOT,52,1,12.3,2.1,5.7,0.374,0.9,2.8,0.319,1.2,2.9,0.427,0.452,1.2,1.4,0.889,0.1,1.6,1.7,0.9,0.3,0.1,0.7,0.6,6.3


The bulk of the confidence lies in the first class. 

In [33]:
#Most Confident would be a probability > 66%
gmm_df.nlargest(5, 1)

Unnamed: 0,0,1,2,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
246,1.056428e-21,1.0,1.52094e-21,2019,Andre Ingram,SG,33,LAL,4,0,3.8,0.0,1.5,0.0,0.0,0.8,0.0,0.0,0.8,0.0,0.0,0.0,0.0,,0.3,0.3,0.5,0.0,0.3,0.0,0.3,0.0,0.0
259,1.0020220000000001e-17,1.0,2.6192909999999996e-19,2019,John Jenkins,SG,27,TOT,26,0,12.8,1.6,4.0,0.4,0.8,2.2,0.379,0.8,1.8,0.426,0.505,0.6,0.7,0.833,0.2,1.2,1.4,0.8,0.0,0.1,0.3,0.4,4.7
454,1.202867e-17,1.0,6.702938999999999e-36,2019,J.R. Smith,SG,33,CLE,11,4,20.2,2.5,7.2,0.342,1.1,3.5,0.308,1.4,3.6,0.375,0.418,0.7,0.9,0.8,0.0,1.6,1.6,1.9,1.0,0.3,1.0,1.7,6.7
6,4.471822e-15,1.0,6.351628e-20,2019,DeVaughn Akoon-Purcell,SG,25,DEN,7,0,3.1,0.4,1.4,0.3,0.0,0.6,0.0,0.4,0.9,0.5,0.3,0.1,0.3,0.5,0.1,0.4,0.6,0.9,0.3,0.0,0.3,0.6,1.0
405,3.892082e-15,1.0,1.5210390000000002e-17,2019,Jakob Pöltl,C,23,SAS,77,24,16.5,2.4,3.8,0.645,0.0,0.0,,2.4,3.8,0.645,0.645,0.6,1.2,0.533,2.3,3.0,5.3,1.2,0.4,0.9,0.6,1.6,5.5


In [35]:
#Least Confident would be a probablity < 33%
gmm_df.nsmallest(5, 1)

Unnamed: 0,0,1,2,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
668,0.846636,2.8e-05,0.153336,2018,Larry Drew,PG,27,TOT,10,0,7.0,0.7,2.4,0.292,0.3,0.9,0.333,0.4,1.5,0.267,0.354,0.0,0.0,,0.3,0.0,0.3,1.0,0.0,0.0,0.3,0.6,1.7
63,0.874534,3e-05,0.125435,2019,Tony Bradley,C,21,UTA,3,0,12.0,2.7,5.3,0.5,0.0,0.0,,2.7,5.3,0.5,0.5,0.3,0.7,0.5,3.0,2.0,5.0,0.3,0.7,0.7,1.0,2.0,5.7
762,0.857479,3.3e-05,0.142489,2018,Serge Ibaka,PF,28,TOR,76,76,27.5,5.0,10.3,0.483,1.4,3.9,0.36,3.6,6.4,0.559,0.552,1.2,1.6,0.797,1.0,5.3,6.3,0.8,0.4,1.3,1.2,2.8,12.6
533,0.897747,3.9e-05,0.102214,2018,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,0.512,0.0,0.1,0.0,2.5,4.8,0.523,0.512,1.9,2.6,0.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
59,0.797911,4.1e-05,0.202048,2019,Isaac Bonga,PG,19,LAL,22,0,5.5,0.2,1.5,0.152,0.0,0.4,0.0,0.2,1.1,0.2,0.152,0.4,0.7,0.6,0.4,0.7,1.1,0.7,0.4,0.2,0.3,0.4,0.9
