NBA teams are increasingly trotting out lineups with five players who can play and guard nearly any position. traditional positions don't accurately explain what a players skillset truly is, they incorrectly oversimplify the skill sets of NBA players. Simply plugging players into one of five positions does not accurately define a player’s specific skill set. Moreover, the misclassification of a player’s position may lead teams to waste resources on developing draft picks that do not fit their systems.

In light of these changes, we need an effective way to designate positions in the NBA not based on basic physical traits such as height and weight, but in terms of function, such as shooting and defense. A framework for modern NBA positions is important towards our understanding for how players have evolved, and effective roster construction. 

### Import the Required Python Packages and Methods

In [134]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import csv


### Finding and Evaluating Historical NBA Player Data

I will use the Kaggle dataset "NBA Players stats since 1950", with stats for all players since 1950. The file Seasons_Stats.csv contains the statics of all players since 1950, with details of each column as below:

    Year - Season
    Player - name
    Pos - Position
    Age - Age
    Tm - Team
    G - Games
    GS - Games Started
    MP - Minutes Played
    PER - Player Efficiency Rating
    TS% - True Shooting %
    3PAr - 3-Point Attempt Rate
    FTr - Free Throw Rate
    ORB% - Offensive Rebound Percentage
    DRB% - Defensive Rebound Percentage
    TRB% - Total Rebound Percentage
    AST% - Assist Percentage
    STL% - Steal Percentage
    BLK% - Block Percentage
    TOV% - Turnover Percentage
    USG% - Usage Percentage
    blanl
    OWS - Offensive Win Shares
    DWS - Defensive Win Shares
    WS - Win Shares
    WS/48 - Win Shares Per 48 Minutes
    blank2
    OBPM - Offensive Box Plus/Minus
    DBPM - Defensive Box Plus/Minus
    BPM - Box Plus/Minus
    VORP - Value Over Replacement
    FG - Field Goals
    FGA - Field Goal Attempts
    FG% - Field Goal Percentage
    3P - 3-Point Field Goals
    3PA - 3-Point Field Goal Attempts
    3P% - 3-Point Field Goal Percentage
    2P - 2-Point Field Goals
    2PA - 2-Point Field Goal Attempts
    2P% - 2-Point Field Goal Percentage
    eFG% - Effective Field Goal Percentage
    FT - Free Throws
    FTA - Free Throw Attempts
    FT% - Free Throw Percentage
    ORB - Offensive Rebounds
    DRB - Defensive Rebounds
    TRB - Total Rebounds
    AST - Assists
    STL - Steals
    BLK - Blocks
    TOV - Turnovers
    PF - Personal Fouls
    PTS - Points

In [135]:
with open('data/Seasons_Stats.csv','rb') as csvfile:
    reader1 = csv.reader(csvfile)
df1 = pd.read_csv('data/Seasons_Stats.csv')

First, we remove duplicated or empty rows and drop a couple of blank columns.

In [136]:
df1_no_duplicates = df1.drop_duplicates()
df2 = df1_no_duplicates
del df2['blanl']
del df2['blank2']
df2.dropna(how='all')
df2.head(5)

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,,0.467,,,,,,,,,-0.1,3.6,3.5,,,,,,144.0,516.0,0.279,,,,144.0,516.0,0.279,0.279,170.0,241.0,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,,0.387,,,,,,,,,1.6,0.6,2.2,,,,,,102.0,274.0,0.372,,,,102.0,274.0,0.372,0.372,75.0,106.0,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,,0.259,,,,,,,,,0.9,2.8,3.6,,,,,,174.0,499.0,0.349,,,,174.0,499.0,0.349,0.349,90.0,129.0,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,,0.395,,,,,,,,,-0.5,-0.1,-0.6,,,,,,22.0,86.0,0.256,,,,22.0,86.0,0.256,0.256,19.0,34.0,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,,0.378,,,,,,,,,-0.5,-0.1,-0.6,,,,,,21.0,82.0,0.256,,,,21.0,82.0,0.256,0.256,17.0,31.0,0.548,,,,20.0,,,,27.0,59.0


A second file, players.csv, contains static information for each player, as height, weight, etc.

In [137]:
players = pd.read_csv('data/Players.csv', index_col=0)
players.head(5)

Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


I then selected these three data categories:
    
    Player - Player's full name (first and last)
    height - Height in cm
    weight - Weight in kg

1.The players have unique names (checked at the beginning), we can merge the two dataframes using the 'Player' column.

In [138]:
data = pd.merge(df2, players[['Player', 'height', 'weight']], left_on='Player', right_on='Player', right_index=False,
      how='left', sort=False)
#.fillna(value=0)

2.Using rate statistics (i.e. points per game) or cumulative statistics (i.e. total points) can be misleading when it comes to analysis because these statistics tend to inflate players with lengthier careers. To deal with outliers, I instituted a minimum threshold of 40 games played. Also I keep only players with more than 400 minutes for each season (with a 82 games regular season, thats around 5 minutes per game. Players with less than that will be only anecdotical, and will distort the analysis).

In [139]:
data = data[~(data['Pos']==0) & (data['MP'] > 400) & (data['G'] > 40)]
data.reset_index(inplace=True, drop=True)

In [140]:
data.Pos.value_counts()

PF       3171
SF       3066
SG       3044
PG       2972
C        2960
SF-SG      22
C-PF       21
SG-SF      17
SG-PG      17
PG-SG      17
PF-C       16
PF-SF      15
SF-PF      13
F-C         9
G-F         6
SG-PF       3
G           3
C-F         3
F-G         2
C-SF        2
F           2
PG-SF       1
SF-PG       1
Name: Pos, dtype: int64

In [141]:
data = data[data.Pos.isin(['PF', 'PG', 'C', 'SG', 'SF'])]

3.Replace the * sign in some of the names.

In [142]:
data['Player'] = data['Player'].str.replace('*','')

4.For the stats that represent total values (others, as TS%, represent percentages), we will take the values per 36 minutes. The reason is to judge every player according to his characteristics, not the time he was on the floor.

In [143]:
totals = ['PER', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA',
         'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

for col in totals:
    data[col] = 36 * data[col] / data['MP']

In [144]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15213 entries, 0 to 15382
Data columns (total 53 columns):
Unnamed: 0    15213 non-null int64
Year          15213 non-null float64
Player        15213 non-null object
Pos           15213 non-null object
Age           15211 non-null float64
Tm            15213 non-null object
G             15213 non-null float64
GS            11070 non-null float64
MP            15213 non-null float64
PER           15212 non-null float64
TS%           15213 non-null float64
3PAr          11543 non-null float64
FTr           15213 non-null float64
ORB%          12868 non-null float64
DRB%          12868 non-null float64
TRB%          13409 non-null float64
AST%          14117 non-null float64
STL%          12868 non-null float64
BLK%          12868 non-null float64
TOV%          12039 non-null float64
USG%          12039 non-null float64
OWS           15212 non-null float64
DWS           15212 non-null float64
WS            15212 non-null float64
WS/48   

from the data information above, the ideal number of data rows we want to work with is 15383, most amount of data is missing from 

    3P  (3-Point Field Goals)                  11543 non-null float64
    3PA (3-Point Field Goal Attempts)          11543 non-null float64
    3P% (3-Point Field Goal Percentage)        10478 non-null float64
    3PAr(3-Point Attempt Rate)                 11543 non-null float64
    
It's becasue 3-point line was not introduced to NBA until 1979, but for the sake of classifying modern players, I have to assume all players before 1979 have not attempted 3-pointers, so I will fill the missing values of these three columns with 0. 

In addition:

    GS  (Games Started)                        11070 non-null float64

Since number of Games started for each player does not effect their performance measuring, I will delete this column.

In [145]:
threePointsCol = ['3P','3PA','3P%','3PAr']
for feature in threePointsCol:
    data[feature] = data[feature].fillna(0)
del data['GS']
data[threePointsCol].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15213 entries, 0 to 15382
Data columns (total 4 columns):
3P      15213 non-null float64
3PA     15213 non-null float64
3P%     15213 non-null float64
3PAr    15213 non-null float64
dtypes: float64(4)
memory usage: 594.3 KB


The next groups of columns that have substantical amount of missing data is from:
    
    ORB           12868 non-null float64
    DRB           12868 non-null float64
    STL           12868 non-null float64
    BLK           12868 non-null float64
    TOV           12039 non-null float64
    
And their associate percentages(plus USG%, relevant to TOV):
    
    ORB%          12868 non-null float64
    DRB%          12868 non-null float64
    STL%          12868 non-null float64
    BLK%          12868 non-null float64
    TOV%          12039 non-null float64
    USG%          12039 non-null float64
    
Since the above data has minimul 12039 or 12868 rows, the commonality is due to the fact ORB, DRB, STL, BLK and their associate percentages are not included in this dataset before 1974, and TOV data are not present until 1978. These missing values cannot be filled with assumption, but at the same time these columns are too important to be removed from the construction of player classification, therefore all the data before 1978 have to be discarded. 

In [146]:
data = data[~(data['Year'] < 1978)]

5.Cast these columns to be of type 'int'.

In [147]:
integerCol = ['Year','Age', 'G']
for feature in integerCol:
    data[feature] = data[feature].astype(dtype ='int')

In [148]:
pd.options.display.max_columns = None
data.sample(5).transpose()

Unnamed: 0,13948,9222,12609,10036,5262
Unnamed: 0,22253,14326,19954,15712,7934
Year,2014,1999,2010,2002,1986
Player,Quincy Acy,Olden Polynice,J.J. Barea,Allan Houston,Gene Banks
Pos,SF,C,PG,SG,SF
Age,23,34,25,30,26
Tm,TOT,SEA,DAL,NYK,CHI
G,63,48,78,77,82
MP,847,1481,1546,2914,2139
PER,0.42928,0.286833,0.29806,0.187783,0.26087
TS%,0.52,0.461,0.526,0.54,0.559


In [149]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12039 entries, 3218 to 15382
Data columns (total 52 columns):
Unnamed: 0    12039 non-null int64
Year          12039 non-null int64
Player        12039 non-null object
Pos           12039 non-null object
Age           12039 non-null int64
Tm            12039 non-null object
G             12039 non-null int64
MP            12039 non-null float64
PER           12039 non-null float64
TS%           12039 non-null float64
3PAr          12039 non-null float64
FTr           12039 non-null float64
ORB%          12039 non-null float64
DRB%          12039 non-null float64
TRB%          12039 non-null float64
AST%          12039 non-null float64
STL%          12039 non-null float64
BLK%          12039 non-null float64
TOV%          12039 non-null float64
USG%          12039 non-null float64
OWS           12039 non-null float64
DWS           12039 non-null float64
WS            12039 non-null float64
WS/48         12039 non-null float64
OBPM       

In [150]:
data.to_csv('data/Seasons_Stats_cleansed.csv', sep='\t')