## Day 47 Lecture 1 Assignment

In this assignment, we will apply k-means clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
plt.style.use(['dark_background'])


This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [2]:
url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv"
df = pd.read_csv(url)
df.head()



Unnamed: 0,Seas,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,2019,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,...,0.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,2019,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,...,0.7,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,2019,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,...,0.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,2019,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,...,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,2019,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,...,0.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9


The goal is to cluster these player-seasons to identify potential player "archetypes".

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [4]:
df.columns

Index(['Seas', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA',
       'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA',
       'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [8]:
# filter df to 20+ games played and 10+ avg minutes
df = df[(df.G >= 20) & (df.MP >= 10)]

# create new column for PTS/MP
df["PTS/MIN"] = df["PTS"]/df["MP"]

Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [21]:
cols = ["PTS", "TRB", "AST", "STL", "BLK", "2P%", "3P%", "PTS/MIN"]
stats = df[cols]
stats = stats.dropna(axis=0)


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [22]:
# scale stat columns
scaler = StandardScaler()
scaled = scaler.fit_transform(stats)
scaled_stats = pd.DataFrame(scaled, columns=stats.columns)
scaled_stats.head()



Unnamed: 0,PTS,TRB,AST,STL,BLK,2P%,3P%,PTS/MIN
0,-0.859882,-1.112194,-0.918195,-0.627943,-0.597966,0.051831,0.050799,-1.095941
1,-1.228958,-0.982984,-0.202719,-0.879617,-0.842925,-2.195786,0.183915,-1.292292
2,0.651573,2.33341,-0.367829,1.8888,1.361711,1.604142,-2.815635,-0.017304
3,-0.22718,1.385869,-0.037609,0.378754,0.871792,1.474783,-1.040753,-0.286086
4,1.952128,2.204199,0.072464,-0.627943,2.096589,0.504588,-0.703526,1.754469


Run K-means clustering with K = 3 and print out the resulting centroids. When printing the centroids, transform the scaled centroids back into their corresponding unscaled values. What "archetypes" do you see?

In [26]:
# k-means with k=3
kmeans = KMeans(n_clusters=3)
kmeans.fit(scaled_stats)

centroids = scaler.inverse_transform(kmeans.cluster_centers_)
centroids_df = pd.DataFrame(centroids, columns=scaled_stats.columns)
centroids_df.style.background_gradient()


Unnamed: 0,PTS,TRB,AST,STL,BLK,2P%,3P%,PTS/MIN
0,8.861538,6.011154,1.290769,0.618462,0.900385,0.560169,0.164508,0.404831
1,7.331621,2.860274,1.690411,0.615868,0.262785,0.47683,0.34584,0.362069
2,17.639948,5.567885,4.253786,1.144125,0.549086,0.499436,0.35565,0.55633


> The centroids look like they're grouped into centers, guards, and forwards

Experiment with different values of K. Do any further interesting archetypes come out?

In [None]:
# k-means with k=5
kmeans = KMeans(n_clusters=5)
kmeans.fit(scaled_stats)

centroids = scaler.inverse_transform(kmeans.cluster_centers_)
centroids_df = pd.DataFrame(centroids, columns=scaled_stats.columns)
centroids_df.style.background_gradient()


