## Day 47 Lecture 1 Assignment

In this assignment, we will apply k-means clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [7]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [11]:
# answer goes here
df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv')

In [12]:
df.shape

(2141, 30)

The goal is to cluster these player-seasons to identify potential player "archetypes".

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [62]:
# answer goes here
df_ = df.loc[(df.MP >= 10) | (df.GS >= 20)]

Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

#### *TRB: total rebounds*
#### *AST: assists*
#### *STL: steals*
#### *BLK: blocks*
#### *TOV: turnovers*
#### *PTS: points*

In [63]:
# answer goes here
X = df_[['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV']]

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [64]:
# answer goes here
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

Run K-means clustering with K = 3 and print out the resulting centroids. When printing the centroids, transform the scaled centroids back into their corresponding unscaled values. What "archetypes" do you see?

In [65]:
# answer goes here
k_means = KMeans(n_clusters=3)
X_scale['cluster'] = k_means.fit_predict(X_scale)

centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

In [66]:
inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,12.503014,7.171781,1.785753,0.827123,0.975068,1.486301
1,6.744118,2.841533,1.48467,0.549376,0.284046,0.871925
2,17.69639,4.659928,5.181949,1.227076,0.413718,2.505415


Experiment with different values of K. Do any further interesting archetypes come out?

In [75]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=4)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,5.758363,2.782918,1.104508,0.444009,0.299526,0.73274
1,11.257252,3.773855,2.658206,0.92042,0.328817,1.440267
2,12.927273,8.125108,1.797835,0.750649,1.222511,1.583117
3,19.837952,5.399398,6.106627,1.374699,0.493976,2.871687


In [76]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=5)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,18.183099,10.501408,2.546479,0.956338,1.776056,2.291549
1,11.200457,3.360046,2.86895,0.914384,0.271233,1.46895
2,5.662169,2.658134,1.094325,0.438714,0.27377,0.719672
3,10.628428,6.478261,1.534114,0.730769,0.850167,1.273913
4,19.433129,5.119632,6.068098,1.370552,0.447853,2.804294


In [77]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=6)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,8.719547,2.686686,2.703683,0.766289,0.215581,1.294618
1,18.072368,10.418421,2.601316,0.968421,1.713158,2.260526
2,9.529482,6.276494,1.335857,0.641434,0.905976,1.171713
3,5.479798,2.676912,0.97013,0.417749,0.275902,0.666522
4,19.520979,5.132168,6.397203,1.382517,0.434266,2.891608
5,14.516129,4.806855,2.735887,1.072177,0.412097,1.679435


In [78]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=7)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,10.022011,2.816304,2.872283,0.82663,0.220652,1.394565
1,7.20906,4.840268,1.062081,0.523826,0.738255,0.902349
2,16.291566,9.636145,2.036145,0.806024,1.756627,1.916867
3,5.400831,2.332724,1.054153,0.426744,0.200166,0.680233
4,23.677778,8.369444,7.588889,1.605556,0.877778,3.808333
5,13.206818,6.471818,2.042727,0.981818,0.604091,1.506818
6,18.054777,4.386624,5.435032,1.275796,0.364331,2.519108


In [79]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=8)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,12.637561,7.19561,1.853171,0.726341,0.817561,1.49122
1,5.310909,2.201636,1.082727,0.428909,0.182727,0.666909
2,10.550836,2.832441,3.025753,0.732107,0.196656,1.49699
3,17.948408,4.398089,5.323567,1.255414,0.380255,2.467516
4,23.483333,7.942857,7.392857,1.62381,0.811905,3.735714
5,16.701667,10.338333,2.038333,0.873333,1.918333,2.065
6,6.672,4.330909,0.943636,0.457818,0.639636,0.816364
7,9.800568,4.138636,1.959091,1.198295,0.440341,1.147727


*The best archetype seems to be 4 if K>3 which was originally pushed off from 3 when K=3.*