## Day 47 Lecture 1 Assignment

In this assignment, we will apply k-means clustering to a dataset containing player-season statistics for NBA players from the past four years.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

This dataset contains player-season statistics for NBA players from the past four years. Each row in this dataset represents a player's per-game averages for a single season. 

This dataset contains the following variables:

- Seas: season ('2019' = 2018-2019 season, '2018' = 2017-2018 season, etc.)
- Player: player name
- Pos: position
- Age: age
- Tm: team
- G: games played
- GS: games started
- MP: minutes played
- FG: field goals
- FGA: field goals attempted
- FG%: field goal percentage
- 3P: 3 pointers
- 3PA: 3 pointers attempted
- 3P%: 3 point percentage
- 2P: 2 pointers
- 2PA: 2 pointers attempted
- 2P%: 2 point percentage
- eFG%: effective field goal percentage
- FT: free throws
- FTA: free throws attempted
- FT%: free throw percentage
- ORB: offensive rebound
- DRB: defensive rebound
- TRB: total rebounds
- AST: assists
- STL: steals
- BLK: blocks
- TOV: turnovers
- PF: personal fouls
- PTS: points

Load the dataset.

In [3]:
# answer goes here

bb_df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv')

bb_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2141 entries, 0 to 2140
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Seas    2141 non-null   int64  
 1   Player  2141 non-null   object 
 2   Pos     2141 non-null   object 
 3   Age     2141 non-null   int64  
 4   Tm      2141 non-null   object 
 5   G       2141 non-null   int64  
 6   GS      2141 non-null   int64  
 7   MP      2141 non-null   float64
 8   FG      2141 non-null   float64
 9   FGA     2141 non-null   float64
 10  FG%     2131 non-null   float64
 11  3P      2141 non-null   float64
 12  3PA     2141 non-null   float64
 13  3P%     1967 non-null   float64
 14  2P      2141 non-null   float64
 15  2PA     2141 non-null   float64
 16  2P%     2110 non-null   float64
 17  eFG%    2131 non-null   float64
 18  FT      2141 non-null   float64
 19  FTA     2141 non-null   float64
 20  FT%     2037 non-null   float64
 21  ORB     2141 non-null   float64
 22  

The goal is to cluster these player-seasons to identify potential player "archetypes".

Begin by removing players whose season did not meet one of the following criteria:
1. Started at least 20 games
2. Averaged at least 10 minutes per game

In [4]:
# answer goes here

new_df = bb_df.loc[(bb_df.MP >= 10) | (bb_df.GS >= 20)]



Choose a subset of numeric columns that is interesting to you from an "archetypal" standpoint. 

We will choose the following basic statistics: **points, total rebounds, assists, steals, blocks**, and **turnovers**, but you should feel free to choose other reasonable feature sets if you like. Be careful not to include too many dimensions (curse of dimensionality).

In [5]:
# answer goes here


X = new_df[['PTS', 'TRB', 'AST', 'STL', 'BLK', 'TOV']]


Standardize the features in your dataset using scikit-learn's StandardScaler, which will set the mean of each feature to 0 and the variance to 1.

In [9]:
# answer goes here

scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)



Run K-means clustering with K = 3 and print out the resulting centroids. When printing the centroids, transform the scaled centroids back into their corresponding unscaled values. What "archetypes" do you see?

In [10]:
# answer goes here


k_means = KMeans(n_clusters=3)
X_scale['cluster'] = k_means.fit_predict(X_scale)

centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)


In [11]:
inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,6.770469,2.84863,1.497613,0.553492,0.284085,0.877011
1,12.535457,7.197507,1.783102,0.822715,0.981717,1.48892
2,17.829779,4.693382,5.204044,1.232353,0.417279,2.519853


Experiment with different values of K. Do any further interesting archetypes come out?

In [12]:
# answer goes here

scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=4)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()



Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,19.837952,5.399398,6.106627,1.374699,0.493976,2.871687
1,5.76327,2.777725,1.106754,0.444431,0.298934,0.733412
2,12.897835,8.126407,1.792641,0.746753,1.222078,1.578355
3,11.272849,3.783556,2.659847,0.922371,0.330019,1.442639


In [13]:
scaler = StandardScaler()
X_scale = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
k_means = KMeans(n_clusters=6)
X_scale['cluster'] = k_means.fit_predict(X_scale)
centers = pd.DataFrame(k_means.cluster_centers_, columns=X.columns)

inverse = pd.DataFrame(scaler.inverse_transform(centers), columns=X.columns)
inverse.style.background_gradient()

Unnamed: 0,PTS,TRB,AST,STL,BLK,TOV
0,5.498403,2.329393,1.101438,0.434984,0.197284,0.698083
1,17.398795,9.968675,2.292771,0.883133,1.771084,2.149398
2,13.416087,6.60087,2.054348,0.951304,0.639565,1.529565
3,10.506199,2.921833,3.016981,0.867385,0.231536,1.45876
4,19.448718,5.061538,6.2,1.373718,0.433333,2.832051
5,7.143624,4.762752,1.054362,0.529195,0.731879,0.887919


In [14]:
#used clusters 3, 4, and 6 big differences each time. 