The traditional five player positions incorrectly oversimplify the skill sets of NBA players. Simply pigeon-holing players into one of five positions does not accurately define a player’s specific skill set. Moreover, the misclassification of a player’s position may lead teams to waste resources on developing draft picks that do not fit their systems.

In light of these changes, we need an effective way to designate positions in the NBA not based on physical traits such as height, but in terms of function, such as shooting and defense. A framework for modern NBA positions is important towards our understanding for how players have evolved, and effective roster construction. I set out to dissect the positional landscape in the NBA today.

My goal was to:
1. Use unsupervised clustering to delineate true functional positions of NBA players.
2. Uncover insight in the evolution of NBA player positions over time, and relationships between similar players.

We will use the Kaggle dataset "NBA Players stats since 1950", with stats for all players since 1950. The file Seasons_Stats.csv contains the statics of all players since 1950. 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import csv
#csvfile

with open('data/Seasons_Stats.csv','rb') as csvfile:
    reader1 = csv.reader(csvfile)
df1 = pd.read_csv('data/Seasons_Stats.csv')

First, we remove duplicated or empty rows and drop a couple of blank columns.

In [6]:
df1_no_duplicates = df1.drop_duplicates()
df2 = df1_no_duplicates
del df1['blanl']
del df1['blank2']
df2.dropna(how='all')

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0
5,5,1950.0,Ed Bartels,F,24.0,NYK,2.0,,,,...,0.667,,,,0.0,,,,2.0,4.0
6,6,1950.0,Ralph Beard,G,22.0,INO,60.0,,,,...,0.762,,,,233.0,,,,132.0,895.0
7,7,1950.0,Gene Berce,G-F,23.0,TRI,3.0,,,,...,0.000,,,,2.0,,,,6.0,10.0
8,8,1950.0,Charlie Black,F-C,28.0,TOT,65.0,,,,...,0.651,,,,163.0,,,,273.0,661.0
9,9,1950.0,Charlie Black,F-C,28.0,FTW,36.0,,,,...,0.632,,,,75.0,,,,140.0,382.0


A second file, players.csv, contains static information for each player, as height, weight, etc.

In [9]:
players = pd.read_csv('data/Players.csv', index_col=0)
players.head(10)

Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky
5,Gene Berce,180.0,79.0,Marquette University,1926.0,,
6,Charlie Black,196.0,90.0,University of Kansas,1921.0,Arco,Idaho
7,Nelson Bobb,183.0,77.0,Temple University,1924.0,Philadelphia,Pennsylvania
8,Jake Bornheimer,196.0,90.0,Muhlenberg College,1927.0,New Brunswick,New Jersey
9,Vince Boryla,196.0,95.0,University of Denver,1927.0,East Chicago,Indiana


1.The players have unique names (checked at the beginning), we can merge the two dataframes using the 'Player' column.

2.Using rate statistics (i.e. points per game) or cumulative statistics (i.e. total points) can be misleading when it comes to analysis because these statistics tend to inflate players with lengthier careers. To deal with outliers, I instituted a minimum threshold of 40 games played.

3.Keep only players with more than 400 minutes for each season (with a 82 games regular season, thats around 5 minutes per game. Players with less than that will be only anecdotical, and will distort the analysis).

4.Replace the * sign in some of the names.

5.For the stats that represent total values (others, as TS%, represent percentages), we will take the values per 36 minutes. The reason is to judge every player according to his characteristics, not the time he was on the floor.

In [21]:
data = pd.merge(df2, players[['Player', 'height', 'weight']], left_on='Player', right_on='Player', right_index=False,
      how='left', sort=False).fillna(value=0)
data = data[~(data['Pos']==0) & (data['MP'] > 200) & (data['G'] > 40)]
data.reset_index(inplace=True, drop=True)
data['Player'] = data['Player'].str.replace('*','')

totals = ['PER', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA',
         'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

for col in totals:
    data[col] = 36 * data[col] / data['MP']

In [22]:
data.tail()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,height,weight
15683,24684,2017.0,Nick Young,SG,31.0,LAL,60.0,60.0,1556.0,0.326221,...,2.59126,3.169666,1.341902,0.856041,0.323907,0.832905,3.169666,18.300771,201.0,95.0
15684,24685,2017.0,Thaddeus Young,PF,28.0,IND,74.0,74.0,2237.0,0.239785,...,5.117568,7.225749,1.963344,1.8346,0.482789,1.544926,2.172553,13.099687,203.0,100.0
15685,24686,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,0.348522,...,5.634783,8.452174,2.066087,1.293913,1.210435,1.356522,3.944348,13.335652,213.0,108.0
15686,24687,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,0.891429,...,5.554286,8.502857,2.88,0.48,1.44,1.371429,4.182857,12.205714,213.0,114.0
15687,24689,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,0.294662,...,4.697509,5.338078,1.537367,0.640569,0.683274,1.708185,3.330961,10.24911,203.0,97.0


casting these columns to be of type int

In [26]:
integerCol = ['Year','Age', 'G', 'GS']
for feature in integerCol:
    data[feature] = data[feature].astype(dtype ='int')

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15688 entries, 0 to 15687
Data columns (total 55 columns):
Unnamed: 0    15688 non-null int64
Year          15688 non-null float64
Player        15688 non-null object
Pos           15688 non-null object
Age           15688 non-null float64
Tm            15688 non-null object
G             15688 non-null float64
GS            15688 non-null float64
MP            15688 non-null float64
PER           15688 non-null float64
TS%           15688 non-null float64
3PAr          15688 non-null float64
FTr           15688 non-null float64
ORB%          15688 non-null float64
DRB%          15688 non-null float64
TRB%          15688 non-null float64
AST%          15688 non-null float64
STL%          15688 non-null float64
BLK%          15688 non-null float64
TOV%          15688 non-null float64
USG%          15688 non-null float64
blanl         15688 non-null float64
OWS           15688 non-null float64
DWS           15688 non-null float64
WS      

In [27]:
pd.options.display.max_columns = None

In [28]:
data.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,blanl,OWS,DWS,WS,WS/48,blank2,OBPM,DBPM,BPM,VORP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,height,weight
0,488,1952,Paul Arizin,SF,23,PHW,66,0,2939.0,0.312351,0.546,0.0,0.579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181286,0.014699,0.195985,0.261,0.0,0.0,0.0,0.0,0.0,6.712487,14.968357,0.448,0.0,0.0,0.0,6.712487,14.968357,0.448,0.448,7.079959,8.660088,0.818,0.0,0.0,9.125553,2.082341,0.0,0.0,0.0,3.062266,20.504934,193.0,86.0
1,489,1952,Cliff Barker,SG,31,INO,44,0,494.0,0.787045,0.343,0.0,0.317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.051012,0.0583,0.007287,0.008,0.0,0.0,0.0,0.0,0.0,3.497976,11.732794,0.298,0.0,0.0,0.0,3.497976,11.732794,0.298,0.298,2.186235,3.716599,0.588,0.0,0.0,5.902834,5.101215,0.0,0.0,0.0,4.080972,9.182186,188.0,83.0
2,490,1952,Don Barksdale,PF,28,BLB,62,0,2014.0,0.282423,0.409,0.0,0.427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003575,0.025025,0.026812,0.036,0.0,0.0,0.0,0.0,0.0,4.861966,14.3714,0.338,0.0,0.0,0.0,4.861966,14.3714,0.338,0.338,4.236346,6.131082,0.691,0.0,0.0,10.7428,2.448858,0.0,0.0,0.0,4.111221,13.960278,198.0,90.0
3,491,1952,Leo Barnhorst,SF,27,INO,66,0,2344.0,0.244198,0.419,0.0,0.208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019966,0.053754,0.072184,0.097,0.0,0.0,0.0,0.0,0.0,5.360068,13.776451,0.389,0.0,0.0,0.0,5.360068,13.776451,0.389,0.389,1.87372,2.872014,0.652,0.0,0.0,6.604096,3.916382,0.0,0.0,0.0,3.010239,12.593857,193.0,86.0
4,494,1952,Nelson Bobb,PG,27,PHW,62,0,1192.0,0.329195,0.42,0.0,0.546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021141,0.012081,0.036242,0.048,0.0,0.0,0.0,0.0,0.0,3.322148,9.241611,0.359,0.0,0.0,0.0,3.322148,9.241611,0.359,0.359,2.989933,5.043624,0.593,0.0,0.0,4.439597,5.073826,0.0,0.0,0.0,5.496644,9.634228,183.0,77.0


In [29]:
data.Pos.value_counts()

PF       3238
SF       3126
SG       3103
C        3038
PG       3011
SF-SG      23
C-PF       22
SG-SF      17
SG-PG      17
PG-SG      17
PF-C       16
PF-SF      15
SF-PF      13
F-C         9
G-F         6
G           3
C-F         3
SG-PF       3
F           2
F-G         2
C-SF        2
SF-PG       1
PG-SF       1
Name: Pos, dtype: int64

In [30]:
data.dtypes.value_counts()

float64    47
int64       5
object      3
dtype: int64