NBA teams are increasingly trotting out lineups with five players who can play and guard nearly any position. traditional positions don't accurately explain what a players skillset truly is, they incorrectly oversimplify the skill sets of NBA players. Simply plugging players into one of five positions does not accurately define a player’s specific skill set. Moreover, the misclassification of a player’s position may lead teams to waste resources on developing draft picks that do not fit their systems.

In light of these changes, we need an effective way to designate positions in the NBA not based on basic physical traits such as height and weight, but in terms of function, such as shooting and defense. A framework for modern NBA positions is important towards our understanding for how players have evolved, and effective roster construction. 

I will use the Kaggle dataset "NBA Players stats since 1950", with stats for all players since 1950. The file Seasons_Stats.csv contains the statics of all players since 1950. 

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import csv

with open('data/Seasons_Stats.csv','rb') as csvfile:
    reader1 = csv.reader(csvfile)
df1 = pd.read_csv('data/Seasons_Stats.csv')

First, we remove duplicated or empty rows and drop a couple of blank columns.

In [36]:
df1_no_duplicates = df1.drop_duplicates()
df2 = df1_no_duplicates
del df2['blanl']
del df2['blank2']
df2.dropna(how='all')
df2.head(5)

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,,0.467,,,,,,,,,-0.1,3.6,3.5,,,,,,144.0,516.0,0.279,,,,144.0,516.0,0.279,0.279,170.0,241.0,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,,0.387,,,,,,,,,1.6,0.6,2.2,,,,,,102.0,274.0,0.372,,,,102.0,274.0,0.372,0.372,75.0,106.0,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,,0.259,,,,,,,,,0.9,2.8,3.6,,,,,,174.0,499.0,0.349,,,,174.0,499.0,0.349,0.349,90.0,129.0,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,,0.395,,,,,,,,,-0.5,-0.1,-0.6,,,,,,22.0,86.0,0.256,,,,22.0,86.0,0.256,0.256,19.0,34.0,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,,0.378,,,,,,,,,-0.5,-0.1,-0.6,,,,,,21.0,82.0,0.256,,,,21.0,82.0,0.256,0.256,17.0,31.0,0.548,,,,20.0,,,,27.0,59.0


A second file, players.csv, contains static information for each player, as height, weight, etc.

In [37]:
players = pd.read_csv('data/Players.csv', index_col=0)
players.head(5)

Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


1.The players have unique names (checked at the beginning), we can merge the two dataframes using the 'Player' column.

In [38]:
data = pd.merge(df2, players[['Player', 'height', 'weight']], left_on='Player', right_on='Player', right_index=False,
      how='left', sort=False).fillna(value=0)

2.Using rate statistics (i.e. points per game) or cumulative statistics (i.e. total points) can be misleading when it comes to analysis because these statistics tend to inflate players with lengthier careers. To deal with outliers, I instituted a minimum threshold of 40 games played. Also I keep only players with more than 400 minutes for each season (with a 82 games regular season, thats around 5 minutes per game. Players with less than that will be only anecdotical, and will distort the analysis).

In [39]:
data = data[~(data['Pos']==0) & (data['MP'] > 400) & (data['G'] > 40)]
data.reset_index(inplace=True, drop=True)

3.Replace the * sign in some of the names.

In [40]:
data['Player'] = data['Player'].str.replace('*','')

4.For the stats that represent total values (others, as TS%, represent percentages), we will take the values per 36 minutes. The reason is to judge every player according to his characteristics, not the time he was on the floor.

In [21]:
totals = ['PER', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA',
         'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

for col in totals:
    data[col] = 36 * data[col] / data['MP']

In [22]:
data.tail()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,height,weight
15683,24684,2017.0,Nick Young,SG,31.0,LAL,60.0,60.0,1556.0,0.326221,...,2.59126,3.169666,1.341902,0.856041,0.323907,0.832905,3.169666,18.300771,201.0,95.0
15684,24685,2017.0,Thaddeus Young,PF,28.0,IND,74.0,74.0,2237.0,0.239785,...,5.117568,7.225749,1.963344,1.8346,0.482789,1.544926,2.172553,13.099687,203.0,100.0
15685,24686,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,0.348522,...,5.634783,8.452174,2.066087,1.293913,1.210435,1.356522,3.944348,13.335652,213.0,108.0
15686,24687,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,0.891429,...,5.554286,8.502857,2.88,0.48,1.44,1.371429,4.182857,12.205714,213.0,114.0
15687,24689,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,0.294662,...,4.697509,5.338078,1.537367,0.640569,0.683274,1.708185,3.330961,10.24911,203.0,97.0


5.Cast these columns to be of type 'int'.

In [41]:
integerCol = ['Year','Age', 'G', 'GS']
for feature in integerCol:
    data[feature] = data[feature].astype(dtype ='int')

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15383 entries, 488 to 24689
Data columns (total 53 columns):
Unnamed: 0    15383 non-null int64
Year          15383 non-null int64
Player        15383 non-null object
Pos           15383 non-null object
Age           15383 non-null int64
Tm            15383 non-null object
G             15383 non-null int64
GS            15383 non-null int64
MP            15383 non-null float64
PER           15383 non-null float64
TS%           15383 non-null float64
3PAr          15383 non-null float64
FTr           15383 non-null float64
ORB%          15383 non-null float64
DRB%          15383 non-null float64
TRB%          15383 non-null float64
AST%          15383 non-null float64
STL%          15383 non-null float64
BLK%          15383 non-null float64
TOV%          15383 non-null float64
USG%          15383 non-null float64
OWS           15383 non-null float64
DWS           15383 non-null float64
WS            15383 non-null float64
WS/48         

In [43]:
pd.options.display.max_columns = None

In [44]:
data.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,height,weight
488,488,1952,Paul Arizin,SF,23,PHW,66,0,2939.0,25.5,0.546,0.0,0.579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.8,1.2,16.0,0.261,0.0,0.0,0.0,0.0,548.0,1222.0,0.448,0.0,0.0,0.0,548.0,1222.0,0.448,0.448,578.0,707.0,0.818,0.0,0.0,745.0,170.0,0.0,0.0,0.0,250.0,1674.0,193.0,86.0
489,489,1952,Cliff Barker,SG,31,INO,44,0,494.0,10.8,0.343,0.0,0.317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.7,0.8,0.1,0.008,0.0,0.0,0.0,0.0,48.0,161.0,0.298,0.0,0.0,0.0,48.0,161.0,0.298,0.298,30.0,51.0,0.588,0.0,0.0,81.0,70.0,0.0,0.0,0.0,56.0,126.0,188.0,83.0
490,490,1952,Don Barksdale,PF,28,BLB,62,0,2014.0,15.8,0.409,0.0,0.427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,1.4,1.5,0.036,0.0,0.0,0.0,0.0,272.0,804.0,0.338,0.0,0.0,0.0,272.0,804.0,0.338,0.338,237.0,343.0,0.691,0.0,0.0,601.0,137.0,0.0,0.0,0.0,230.0,781.0,198.0,90.0
491,491,1952,Leo Barnhorst,SF,27,INO,66,0,2344.0,15.9,0.419,0.0,0.208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.3,3.5,4.7,0.097,0.0,0.0,0.0,0.0,349.0,897.0,0.389,0.0,0.0,0.0,349.0,897.0,0.389,0.389,122.0,187.0,0.652,0.0,0.0,430.0,255.0,0.0,0.0,0.0,196.0,820.0,193.0,86.0
494,494,1952,Nelson Bobb,PG,27,PHW,62,0,1192.0,10.9,0.42,0.0,0.546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,0.4,1.2,0.048,0.0,0.0,0.0,0.0,110.0,306.0,0.359,0.0,0.0,0.0,110.0,306.0,0.359,0.359,99.0,167.0,0.593,0.0,0.0,147.0,168.0,0.0,0.0,0.0,182.0,319.0,183.0,77.0


In [29]:
data.Pos.value_counts()

PF       3238
SF       3126
SG       3103
C        3038
PG       3011
SF-SG      23
C-PF       22
SG-SF      17
SG-PG      17
PG-SG      17
PF-C       16
PF-SF      15
SF-PF      13
F-C         9
G-F         6
G           3
C-F         3
SG-PF       3
F           2
F-G         2
C-SF        2
SF-PG       1
PG-SF       1
Name: Pos, dtype: int64

In [30]:
data.dtypes.value_counts()

float64    47
int64       5
object      3
dtype: int64