## Day 47 Lecture 1 Assignment

In this assignment, we will apply k-means clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [3]:
# answer goes here

url = 'https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2019,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,0.357,1.3,4.1,0.323,0.5,1.0,0.5,0.487,0.4,0.4,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,2019,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,0.222,0.2,1.5,0.133,0.2,0.3,0.667,0.278,0.7,1.0,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,2019,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,0.345,0.7,2.2,0.338,0.4,1.1,0.361,0.459,0.2,0.3,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,0.595,0.0,0.0,0.0,6.0,10.1,0.596,0.595,1.8,3.7,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,0.576,0.0,0.2,0.2,3.4,5.7,0.588,0.579,2.0,2.8,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


In [5]:
df.isnull().sum().loc[lambda x: x > 0]

FG%      10
3P%     174
2P%      31
eFG%     10
FT%     104
dtype: int64

The goal is to cluster these player-seasons to identify potential player "archetypes".

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [None]:
# answer goes here
df = df[df['GS'] >=20]
df = df[df['MP'] >= 10]

Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [None]:
# answer goes here
X = df[['PTS','TRB', 'AST', 'STL', 'BLK', 'TOV']]

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [None]:
# answer goes here
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

Run K-means clustering with K = 3 and print out the resulting centroids. When printing the centroids, transform the scaled centroids back into their corresponding unscaled values. What "archetypes" do you see?

In [None]:
# answer goes here
kmeans3 = KMeans(n_clusters=3)

kmeans3.fit_predict(X_scaled)

centers = scaler.inverse_transform(kmeans3.cluster_centers_)

In [None]:
centers = pd.DataFrame(centers, columns = X_scaled.columns)
centers.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,10.505405,3.949775,2.063739,0.80473,0.374324,1.235135
1,13.735359,8.605525,1.914917,0.777348,1.230387,1.668508
2,19.49162,5.293855,5.934078,1.35419,0.486034,2.797765


Experiment with different values of K. Do any further interesting archetypes come out?

In [None]:
# answer goes here
kmeans4 = KMeans(n_clusters=4)

kmeans4.fit_predict(X_scaled)

centers = scaler.inverse_transform(kmeans4.cluster_centers_)

In [None]:
centers = pd.DataFrame(centers, columns = X_scaled.columns)
centers.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,14.96,9.123704,2.071852,0.773333,1.361481,1.82
1,8.298893,4.282288,1.390037,0.630996,0.482657,0.975646
2,20.188889,5.561806,6.289583,1.404167,0.506944,2.952083
3,13.636614,4.196063,3.003937,1.034646,0.347638,1.637795


In [None]:
kmeans5 = KMeans(n_clusters=5)

kmeans5.fit_predict(X_scaled)

centers = scaler.inverse_transform(kmeans5.cluster_centers_)

In [None]:
centers = pd.DataFrame(centers, columns = X_scaled.columns)
centers.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,7.755155,3.478351,1.409794,0.608247,0.356186,0.906701
1,11.485185,7.238272,1.660494,0.683951,0.926543,1.383333
2,18.730303,10.577273,2.60303,0.972727,1.692424,2.315152
3,13.441004,4.03682,2.975732,1.035983,0.333473,1.616318
4,19.760839,5.234965,6.290909,1.397203,0.456643,2.881818


In [None]:
kmeans2 = KMeans(n_clusters=2)

kmeans2.fit_predict(X_scaled)

centers = scaler.inverse_transform(kmeans2.cluster_centers_)

In [None]:
centers = pd.DataFrame(centers, columns = X_scaled.columns)
centers.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,11.003378,5.025338,1.96723,0.776858,0.573818,1.303209
1,19.459906,6.056132,5.474057,1.323113,0.642453,2.734434


The more clusters we have the smaller the variance between the groups in terms of Points, but total rebounds has a low varience between the two archetype in cluster 2, and great variance in cluster 5. 