## Day 47 Lecture 1 Assignment

In this assignment, we will apply k-means clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [2]:
# answer goes here
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv')

The goal is to cluster these player-seasons to identify potential player "archetypes".

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [5]:
# answer goes here
data = data[data['GS'] >=20]
data = data[data['MP'] >= 10]

Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [79]:
# answer goes here
data = data[['PTS','TRB', 'AST', 'STL', 'BLK', 'TOV']]

Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [80]:
# answer goes here
sc = StandardScaler()
data_sc = sc.fit_transform(data)
data_sc = pd.DataFrame(data_sc, columns=data.columns)

Run K-means clustering with K = 3 and print out the resulting centroids. When printing the centroids, transform the scaled centroids back into their corresponding unscaled values. What "archetypes" do you see?

In [100]:
# answer goes here
kmeans = KMeans(n_clusters=3)

# fit kmeans model
kmeans.fit_predict(data_sc)

# get cluster centers and unscale them
clusters = sc.inverse_transform(kmeans.cluster_centers_)

In [101]:
clusters = pd.DataFrame(clusters, columns = data_sc.columns)
clusters.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,19.49162,5.293855,5.934078,1.35419,0.486034,2.797765
1,13.735359,8.605525,1.914917,0.777348,1.230387,1.668508
2,10.505405,3.949775,2.063739,0.80473,0.374324,1.235135


Archetype 0 has the highest points scored and also ranks the highest in 3 of the 5 other columns. For total rebounds and turnovers cluster 1 came in first. That was the second highest scoring group.

Experiment with different values of K. Do any further interesting archetypes come out?

In [96]:
# trying different values for k
kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(data_sc)
clusters = sc.inverse_transform(kmeans.cluster_centers_)
clusters = pd.DataFrame(clusters, columns=data_sc.columns)

clusters.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,11.003378,5.025338,1.96723,0.776858,0.573818,1.303209
1,19.459906,6.056132,5.474057,1.323113,0.642453,2.734434


In two clusters the players who scores more points also had more of every other column. I don't know anything about baseball/basketball? but, assuming all of these things are good, it makes sense that the players who earn more points would also earn more turnovers, rebounds, etc.

In [97]:
# trying different values for k
kmeans = KMeans(n_clusters=4)
kmeans.fit_predict(data_sc)
clusters = sc.inverse_transform(kmeans.cluster_centers_)
clusters = pd.DataFrame(clusters, columns=data_sc.columns)

clusters.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,8.298893,4.282288,1.390037,0.630996,0.482657,0.975646
1,13.636614,4.196063,3.003937,1.034646,0.347638,1.637795
2,14.96,9.123704,2.071852,0.773333,1.361481,1.82
3,20.188889,5.561806,6.289583,1.404167,0.506944,2.952083


In [98]:
# trying different values for k
kmeans = KMeans(n_clusters=5)
kmeans.fit_predict(data_sc)
clusters = sc.inverse_transform(kmeans.cluster_centers_)
clusters = pd.DataFrame(clusters, columns=data_sc.columns)

clusters.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,11.485185,7.238272,1.660494,0.683951,0.926543,1.383333
1,19.707639,5.226389,6.286806,1.395833,0.454861,2.878472
2,18.730303,10.577273,2.60303,0.972727,1.692424,2.315152
3,13.446639,4.036975,2.964286,1.035294,0.334034,1.613025
4,7.755155,3.478351,1.409794,0.608247,0.356186,0.906701


4 and 5 clusters follow a similar pattern as 3 clusters. For 5 clusters, the players who earned the most points (cluster 1) still had top scores in 3 of the 5 remaining columns. Cluster 2 with players scoring the second most points came in first in rebounds and blocks, just like with the 3 clusters. This also holds true with 4 clusters, with cluster 3 being the top scoring players and cluster 2 the 2nd highest point earners. 

If I'm interpreting this correctly, 2 clusters seems ideal since we have higher scores in all columns in one group and lower scores for all columns in the other group.