In [168]:
#import packages and set options
import pandas as pd
pd.options.mode.copy_on_write = True
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

### Feature Glossary
#### Provided by <a href="https://www.sports-reference.com/sharing.html?utm_source=direct&utm_medium=Share&utm_campaign=ShareTool">Basketball-Reference.com</a>: <a href="https://www.basketball-reference.com/leaders/trb_career.html?sr&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool&utm_source=direct&utm_medium=Share&utm_campaign=ShareTool#nba">View Original Table</a><br>Generated 2/25/2024.

2P - 2-Point Field Goals

2P% - 2-Point Field Goal Percentage; the formula is 2P / 2PA.

2PA - 2-Point Field Goal Attempts

3P - 3-Point Field Goals (available since the 1979-80 season in the NBA)

3P% - 3-Point Field Goal Percentage (available since the 1979-80 season in the NBA); the formula is 3P / 3PA.

3PA - 3-Point Field Goal Attempts (available since the 1979-80 season in the NBA)

Age - Age; player age on February 1 of the given season.

AST - Assists

AST% - Assist Percentage (available since the 1964-65 season in the NBA); the formula is 100 * AST / (((MP / (Tm MP / 5)) * Tm FG) - FG). Assist percentage is an estimate of the percentage of teammate field goals a player assisted while he was on the floor.

Award Share - The formula is (award points) / (maximum number of award points). For example, in the 2002-03 MVP voting Tim Duncan had 962 points out of a possible 1190. His MVP award share is 962 / 1190 = 0.81.

BLK - Blocks (available since the 1973-74 season in the NBA)

BLK% - Block Percentage (available since the 1973-74 season in the NBA); the formula is 100 * (BLK * (Tm MP / 5)) / (MP * (Opp FGA - Opp 3PA)). Block percentage is an estimate of the percentage of opponent two-point field goal attempts blocked by the player while he was on the floor.

BPM - Box Plus/Minus (available since the 1973-74 season in the NBA); a box score estimate of the points per 100 possessions that a player contributed above a league-average player, translated to an average team. Please see the article About Box Plus/Minus (BPM) for more information.

DPOY - Defensive Player of the Year

DRB - Defensive Rebounds (available since the 1973-74 season in the NBA)

DRB% - Defensive Rebound Percentage (available since the 1970-71 season in the NBA); the formula is 100 * (DRB * (Tm MP / 5)) / (MP * (Tm DRB + Opp ORB)). Defensive rebound percentage is an estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.

DRtg - Defensive Rating (available since the 1973-74 season in the NBA); for players and teams it is points allowed per 100 posessions. This rating was developed by Dean Oliver, author of Basketball on Paper. Please see the article Calculating Individual Offensive and Defensive Ratings for more information.

DWS - Defensive Win Shares; please see the article Calculating Win Shares for more information.

eFG% - Effective Field Goal Percentage; the formula is (FG + 0.5 * 3P) / FGA. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal. For example, suppose Player A goes 4 for 10 with 2 threes, while Player B goes 5 for 10 with 0 threes. Each player would have 10 points from field goals, and thus would have the same effective field goal percentage (50%).

FG - Field Goals (includes both 2-point field goals and 3-point field goals)

FG% - Field Goal Percentage; the formula is FG / FGA.

FGA - Field Goal Attempts (includes both 2-point field goal attempts and 3-point field goal attempts)

FT - Free Throws

FT% - Free Throw Percentage; the formula is FT / FTA.

FTA - Free Throw Attempts

Four Factors - Dean Oliver's "Four Factors of Basketball Success"; please see the article Four Factors for more information.

G - Games

GB - Games Behind; the formula is ((first W - W) + (L - first L)) / 2, where first W and first L stand for wins and losses by the first place team, respectively.

GmSc - Game Score; the formula is PTS + 0.4 * FG - 0.7 * FGA - 0.4*(FTA - FT) + 0.7 * ORB + 0.3 * DRB + STL + 0.7 * AST + 0.7 * BLK - 0.4 * PF - TOV. Game Score was created by John Hollinger to give a rough measure of a player's productivity for a single game. The scale is similar to that of points scored, (40 is an outstanding performance, 10 is an average performance, etc.).

GS - Games Started (available since the 1982 season)

L - Losses

L Pyth - Pythagorean Losses; the formula is G - W Pyth.

Lg - League

MVP - Most Valuable Player

MP - Minutes Played (available since the 1951-52 season)

MOV - Margin of Victory; the formula is PTS - Opp PTS.

ORtg - Offensive Rating (available since the 1977-78 season in the NBA); for players it is points produced per 100 posessions, while for teams it is points scored per 100 possessions. This rating was developed by Dean Oliver, author of Basketball on Paper. Please see the article Calculating Individual Offensive and Defensive Ratings for more information.

Opp - Opponent

ORB - Offensive Rebounds (available since the 1973-74 season in the NBA)

ORB% - Offensive Rebound Percentage (available since the 1970-71 season in the NBA); the formula is 100 * (ORB * (Tm MP / 5)) / (MP * (Tm ORB + Opp DRB)). Offensive rebound percentage is an estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.

OWS - Offensive Win Shares; please see the article Calculating Win Shares for more information.

Pace - Pace Factor (available since the 1973-74 season in the NBA); the formula is 48 * ((Tm Poss + Opp Poss) / (2 * (Tm MP / 5))). Pace factor is an estimate of the number of possessions per 48 minutes by a team. (Note: 40 minutes is used in the calculation for the WNBA.)

PER - Player Efficiency Rating (available since the 1951-52 season); PER is a rating developed by ESPN.com columnist John Hollinger. In John's words, "The PER sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance." Please see the article Calculating PER for more information.

Per 36 Minutes - A statistic (e.g., assists) divided by minutes played, multiplied by 36.

Per Game - A statistic (e.g., assists) divided by games.

PF - Personal Fouls

Poss - Possessions (available since the 1973-74 season in the NBA); the formula for teams is 0.5 * ((Tm FGA + 0.4 * Tm FTA - 1.07 * (Tm ORB / (Tm ORB + Opp DRB)) * (Tm FGA - Tm FG) + Tm TOV) + (Opp FGA + 0.4 * Opp FTA - 1.07 * (Opp ORB / (Opp ORB + Tm DRB)) * (Opp FGA - Opp FG) + Opp TOV)). This formula estimates possessions based on both the team's statistics and their opponent's statistics, then averages them to provide a more stable estimate. Please see the article Calculating Individual Offensive and Defensive Ratings for more information.

PProd - Points Produced; Dean Oliver's measure of offensive points produced. Please see the article Calculating Individual Offensive and Defensive Ratings for more information.

PTS - Points

ROY - Rookie of the Year

SMOY - Sixth Man of the Year

SOS - Strength of Schedule; a rating of strength of schedule. The rating is denominated in points above/below average, where zero is average. A positive number indicates a harder than average schedule. Doug Drinen, creator of Pro-Football-Reference.com, wrote a thorough explanation of this method.

SRS - Simple Rating System; a rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average. Doug Drinen, creator of Pro-Football-Reference.com wrote a thorough explanation of this method.

STL - Steals (available since the 1973-74 season in the NBA)

STL% - Steal Percentage (available since the 1973-74 season in the NBA); the formula is 100 * (STL * (Tm MP / 5)) / (MP * Opp Poss). Steal Percentage is an estimate of the percentage of opponent possessions that end with a steal by the player while he was on the floor.

Stops - Stops; Dean Oliver's measure of individual defensive stops. Please see the article Calculating Individual Offensive and Defensive Ratings for more information.

Tm - Team

TOV - Turnovers (available since the 1977-78 season in the NBA)

TOV% - Turnover Percentage (available since the 1977-78 season in the NBA); the formula is 100 * TOV / (FGA + 0.44 * FTA + TOV). Turnover percentage is an estimate of turnovers per 100 plays.

TRB - Total Rebounds (available since the 1950-51 season)

TRB% - Total Rebound Percentage (available since the 1970-71 season in the NBA); the formula is 100 * (TRB * (Tm MP / 5)) / (MP * (Tm TRB + Opp TRB)). Total rebound percentage is an estimate of the percentage of available rebounds a player grabbed while he was on the floor.

TS% - True Shooting Percentage; the formula is PTS / (2 * TSA). True shooting percentage is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws.

TSA - True Shooting Attempts; the formula is FGA + 0.44 * FTA.

Usg% - Usage Percentage (available since the 1977-78 season in the NBA); the formula is 100 * ((FGA + 0.44 * FTA + TOV) * (Tm MP / 5)) / (MP * (Tm FGA + 0.44 * Tm FTA + Tm TOV)). Usage percentage is an estimate of the percentage of team plays used by a player while he was on the floor.

VORP - Value Over Replacement Player (available since the 1973-74 season in the NBA); a box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement. Please see the article About Box Plus/Minus (BPM) for more information.

W - Wins

W Pyth - Pythagorean Wins; the formula is G * (Tm PTS14 / (Tm PTS14 + Opp PTS14)). The formula was obtained by fitting a logistic regression model with log(Tm PTS / Opp PTS) as the explanatory variable. Using this formula for all BAA, NBA, and ABA seasons, the root mean-square error (rmse) is 3.14 wins. Using an exponent of 16.5 (a common choice), the rmse is 3.48 wins. (Note: An exponent of 10 is used for the WNBA.)

W-L% - Won-Lost Percentage; the formula is W / (W + L).

WS - Win Shares; an estimate of the number of wins contributed by a player. Please see the article Calculating Win Shares for more information.

WS/48 - Win Shares Per 48 Minutes (available since the 1951-52 season in the NBA); an estimate of the number of wins contributed by the player per 48 minutes (league average is approximately 0.100). Please see the article Calculating Win Shares for more information.

Win Probability - The estimated probability that Team A will defeat Team B in a given matchup.

Year - Year that the season occurred. Since the NBA season is split over two calendar years, the year given is the last year for that season. For example, the year for the 1999-00 season would be 2000.

![image.png](attachment:image.png)

In [169]:
#import data
player_totals = pd.read_csv('Player Totals.csv')

In [170]:
#investigate dataframe structure size
player_totals.shape

(31787, 35)

In [171]:
#investigate first 5 rows of the dataset
player_totals.head()

Unnamed: 0,seas_id,season,player_id,player,birth_year,pos,age,experience,lg,tm,g,gs,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
0,31136,2024,5025,A.J. Green,,SG,24.0,2,NBA,MIL,36,0.0,335.0,52,121,0.43,44.0,106.0,0.415,8,15,0.533,0.612,8,8,1.0,6.0,32.0,38.0,21,3.0,2.0,4.0,34,156
1,31137,2024,5026,A.J. Lawson,,SG,23.0,2,NBA,DAL,27,0.0,230.0,39,84,0.464,13.0,40.0,0.325,26,44,0.591,0.542,12,19,0.632,11.0,21.0,32.0,13,9.0,3.0,10.0,19,103
2,31138,2024,5027,AJ Griffin,,SF,20.0,2,NBA,ATL,18,0.0,132.0,13,45,0.289,9.0,33.0,0.273,4,12,0.333,0.389,2,2,1.0,2.0,12.0,14.0,4,1.0,1.0,6.0,6,37
3,31139,2024,4219,Aaron Gordon,,PF,28.0,10,NBA,DEN,49,49.0,1555.0,266,488,0.545,29.0,92.0,0.315,237,396,0.598,0.575,116,181,0.641,123.0,204.0,327.0,150,37.0,34.0,68.0,90,677
4,31140,2024,4582,Aaron Holiday,,PG,27.0,6,NBA,HOU,51,1.0,913.0,134,290,0.462,65.0,159.0,0.409,69,131,0.527,0.574,37,42,0.881,15.0,80.0,95.0,96,28.0,4.0,40.0,83,370


In [172]:
#there are a lot of features, set the output to show all rows vs. a summary
#investigate the feature datatypes.
player_totals.dtypes

seas_id           int64
season            int64
player_id         int64
player           object
birth_year      float64
pos              object
age             float64
experience        int64
lg               object
tm               object
g                 int64
gs              float64
mp              float64
fg                int64
fga               int64
fg_percent      float64
x3p             float64
x3pa            float64
x3p_percent     float64
x2p               int64
x2pa              int64
x2p_percent     float64
e_fg_percent    float64
ft                int64
fta               int64
ft_percent      float64
orb             float64
drb             float64
trb             float64
ast               int64
stl             float64
blk             float64
tov             float64
pf                int64
pts               int64
dtype: object

In [173]:
#investigate for n/a values
player_totals.isnull().sum()

seas_id             0
season              0
player_id           0
player              0
birth_year      28917
pos                 0
age                22
experience          0
lg                  0
tm                  0
g                   0
gs               8637
mp               1083
fg                  0
fga                 0
fg_percent        159
x3p              6352
x3pa             6352
x3p_percent     10537
x2p                 0
x2pa                0
x2p_percent       247
e_fg_percent      159
ft                  0
fta                 0
ft_percent       1299
orb              4657
drb              4657
trb               894
ast                 0
stl              5626
blk              5625
tov              5635
pf                  0
pts                 0
dtype: int64

In [174]:
#drop the birth_year column.  It is unnecessary for performance statistics.
player_totals.drop(columns = ['birth_year'], inplace = True)

In [175]:
#address the 22 n/a values in the age column
player_totals[player_totals['age'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,gs,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
26473,5554,1973,1470,Pete Smith,PF,,1,ABA,SDA,5,,32.0,2,12,0.167,0.0,2.0,0.0,2,10,0.2,0.167,0,0,,3.0,5.0,8.0,1,,,5.0,5,4
27105,4391,1971,1253,Clarence Brookins,F,,1,ABA,FLO,8,,59.0,8,26,0.308,0.0,1.0,0.0,8,25,0.32,0.308,5,12,0.417,8.0,4.0,12.0,1,,,0.0,5,21
27259,4545,1971,1291,Jim Wilson,G,,1,ABA,PTC,6,,44.0,1,8,0.125,0.0,0.0,,1,8,0.125,0.125,4,6,0.667,1.0,5.0,6.0,8,,,5.0,3,6
27861,4293,1970,1231,Walter Byrd,PF,,1,ABA,MMF,22,,109.0,14,43,0.326,0.0,1.0,0.0,14,42,0.333,0.326,5,17,0.294,8.0,17.0,25.0,6,,,8.0,22,33
27872,4304,1970,1233,Wilbur Kirkland,F,,1,ABA,PTP,2,,27.0,3,7,0.429,0.0,0.0,,3,7,0.429,0.429,0,0,,1.0,10.0,11.0,1,,,2.0,5,6
27945,3550,1969,1081,Charles Parks,F,,1,ABA,DNR,2,,5.0,0,1,0.0,0.0,0.0,,0,1,0.0,0.0,0,0,,0.0,0.0,0.0,0,,,0.0,1,0
28322,3141,1968,904,Bill Allen,C,,1,ABA,ANA,38,,857.0,120,280,0.429,2.0,2.0,1.0,118,278,0.424,0.432,58,99,0.586,,,269.0,23,,,38.0,121,300
28351,3170,1968,922,Bobby Wilson,PF,,1,ABA,DLC,69,,1562.0,226,581,0.389,1.0,2.0,0.5,225,579,0.389,0.39,163,265,0.615,,,450.0,55,,,127.0,209,616
28377,3196,1968,939,Darrell Hardy,F,,1,ABA,HSM,17,,172.0,32,74,0.432,0.0,1.0,0.0,32,73,0.438,0.432,25,35,0.714,,,56.0,8,,,12.0,23,89
28387,3206,1968,945,Dexter Westbrook,F,,1,ABA,TOT,12,,127.0,19,39,0.487,0.0,0.0,,19,39,0.487,0.487,10,14,0.714,,,23.0,5,,,13.0,30,48


In [176]:
#looks like all 22 players missing an age value are in the ABA or BAA.
#both leagues are not included in the scope of this project.
#subset Dataframe as 'nba_player_totals'
player_totals['lg'].value_counts()

lg
NBA    29567
ABA     1638
BAA      582
Name: count, dtype: int64

In [177]:
#subset the DataFrame
exclude_values = ['ABA', 'BAA']
nba_player_totals = player_totals[~player_totals['lg'].str.contains('|'.join(exclude_values))]
nba_player_totals.shape

(29567, 34)

In [178]:
#investigate for remaining n/a values
nba_player_totals.isnull().sum()

seas_id            0
season             0
player_id          0
player             0
pos                0
age                0
experience         0
lg                 0
tm                 0
g                  0
gs              6417
mp               501
fg                 0
fga                0
fg_percent       142
x3p             5770
x3pa            5770
x3p_percent     9613
x2p                0
x2pa               0
x2p_percent      230
e_fg_percent     142
ft                 0
fta                0
ft_percent      1224
orb             3900
drb             3900
trb              312
ast                0
stl             3900
blk             3900
tov             5052
pf                 0
pts                0
dtype: int64

In [179]:
# games started (gs) was not a recorded statistic until the 1981-1982 season.
# the values in this column range from 0-83 and this study covers the 1951-2024 seasons.
# starting a game is an indicator of a top player, but not an essential value for performance
# drop the column from the DataSet
nba_player_totals.drop(columns = ['gs'], inplace = True)
nba_player_totals.shape

(29567, 33)

In [180]:
#address the 501 n/a values in the minutes per game (mp) column
#minutes per game was not recorded until the 1951-1952 season
nba_player_totals[nba_player_totals['mp'].isnull()]
#there is a lot of data from the 1949-1950 and 1950-1951 seasons that should be included
#keep mp in the DataFrame as is.  Remove from subsets as needed durint EDA.

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
30704,895,1951,296,Al Cervi,PG,33.0,2,NBA,SYR,53,,132,346,0.382,,,,132,346,0.382,0.382,194,237,0.819,,,152.0,208,,,,180,458
30705,896,1951,417,Alan Sawyer,F,23.0,1,NBA,WSC,33,,87,235,0.370,,,,87,235,0.370,0.370,43,50,0.860,,,121.0,25,,,,75,217
30706,897,1951,299,Alex Groza,C,24.0,2,NBA,INO,66,,492,1046,0.470,,,,492,1046,0.470,0.470,445,566,0.786,,,709.0,156,,,,237,1429
30707,898,1951,300,Alex Hannum,PF,27.0,2,NBA,SYR,63,,182,494,0.368,,,,182,494,0.368,0.368,107,197,0.543,,,301.0,119,,,,271,471
30708,899,1951,204,Andy Duncan,F-C,28.0,3,NBA,BOS,14,,7,40,0.175,,,,7,40,0.175,0.175,15,22,0.682,,,30.0,8,,,,32,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31200,890,1950,414,Warren Perkins,G-F,27.0,1,NBA,TRI,60,,128,422,0.303,,,,128,422,0.303,0.303,115,195,0.590,,,,114,,,,260,371
31201,891,1950,415,Wayne See,G,26.0,1,NBA,WAT,61,,113,303,0.373,,,,113,303,0.373,0.373,94,135,0.696,,,,143,,,,147,320
31202,892,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TOT,59,,120,336,0.357,,,,120,336,0.357,0.357,73,115,0.635,,,,143,,,,127,313
31203,893,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TRI,26,,40,116,0.345,,,,40,116,0.345,0.345,29,46,0.630,,,,36,,,,48,109


In [181]:
#address the 142 n/a values in the field goal % (fg_percent) column
nba_player_totals[nba_player_totals['fg_percent'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
193,31329,2024,5138,Filip Petrušev,C,23.0,1,NBA,PHI,1,3.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,1.0,1.0,0,0.0,0.0,0.0,0,0
279,31415,2024,5060,Jason Preston,PG,24.0,2,NBA,UTA,1,1.0,0,0,,0.0,0.0,,0,0,,,0,0,,1.0,0.0,1.0,0,0.0,0.0,1.0,0,0
281,31417,2024,4955,Javonte Smart,PG,24.0,2,NBA,PHI,1,1.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0
378,31514,2024,4857,Kira Lewis Jr.,PG,22.0,4,NBA,TOR,1,2.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,1,0
416,31552,2024,5169,Malcolm Cazalon,SG,22.0,1,NBA,DET,1,3.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0
428,31564,2024,4759,Marques Bolden,C,25.0,3,NBA,MIL,2,3.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,2.0,2.0,0,0.0,0.0,0.0,1,0
489,31625,2024,5180,Onuralp Bitim,SG,24.0,1,NBA,CHI,1,3.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0
538,31674,2024,5094,Ron Harper Jr.,PF,23.0,2,NBA,TOR,1,4.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,1,0.0,0.0,0.0,2,0
616,31752,2024,5195,Trey Jemison,C,24.0,1,NBA,WAS,2,1.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,1.0,1.0,0,0.0,0.0,1.0,0,0
637,31773,2024,5019,Vit Krejci,PG,23.0,3,NBA,ATL,1,2.0,0,0,,0.0,0.0,,0,0,,,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,0,0


In [182]:
# a quick scroll through the games played (g), minutes played (mp), personal foul (pf), and points scored (pts) columns
# reveals that these players hardly played and when they did it was either to foul a player of shoot free throws. 
# they did not attempt a field goal, therefore their fg% should be 0.
nba_player_totals.fillna({'fg_percent':0}, inplace = True)
nba_player_totals.fillna({'e_fg_percent':0}, inplace = True)

In [183]:
nba_player_totals.isnull().sum()

seas_id            0
season             0
player_id          0
player             0
pos                0
age                0
experience         0
lg                 0
tm                 0
g                  0
mp               501
fg                 0
fga                0
fg_percent         0
x3p             5770
x3pa            5770
x3p_percent     9613
x2p                0
x2pa               0
x2p_percent      230
e_fg_percent       0
ft                 0
fta                0
ft_percent      1224
orb             3900
drb             3900
trb              312
ast                0
stl             3900
blk             3900
tov             5052
pf                 0
pts                0
dtype: int64

https://www.hoopsaddict.com/nba-3-point-line-history/#When_was_the_3_Point_Line_Introduced_in_The_NBA

In [187]:
#the three point line was introduced for the 1979-1980 NBA season. 
#this was also the rookie season for Larry Bird and Magic Johnson (https://www.hoopsaddict.com/nba-3-point-line-history/)
#the missing values are pre-1980 dates in the DataFrame.  These rows need to be kept and subset later in EDA.
#the x3p_percent column however is a simple calculation of (x3p (points)/x3pa (attempts)).
#if a player is recorded with 0 pts for 3 point attempts his percent is 0.  The raw data shows this as NA. 

#use syntax df['col'] = (value_if_false).where(condition, value_if_true) to correct this.
#the sum should equal the other x3p and x3pa total of 5770. Down from 9613. See above.

nba_player_totals['x3p_percent'] = (nba_player_totals['x3p']).where(nba_player_totals['x3p'] == 0, nba_player_totals['x3p_percent'])
nba_player_totals['x3p_percent'].isnull().sum()

5770

In [188]:
#address the 230 n/a values in the x2p_percent column
#all of these values occur when the x2p column is equal to 0.
#these x columns are in place to seperate 2pt and 3pt points and attempts.  Each counts as an fg and/or a fga as well.
nba_player_totals['x2p_percent'] = (nba_player_totals['x2p']).where(nba_player_totals['x2p'] == 0, nba_player_totals['x2p_percent'])
nba_player_totals['x2p_percent'].isnull().sum()

0

In [189]:
nba_player_totals.isnull().sum()

seas_id            0
season             0
player_id          0
player             0
pos                0
age                0
experience         0
lg                 0
tm                 0
g                  0
mp               501
fg                 0
fga                0
fg_percent         0
x3p             5770
x3pa            5770
x3p_percent     5770
x2p                0
x2pa               0
x2p_percent        0
e_fg_percent       0
ft                 0
fta                0
ft_percent      1224
orb             3900
drb             3900
trb              312
ast                0
stl             3900
blk             3900
tov             5052
pf                 0
pts                0
dtype: int64

In [190]:
#address the 1224 n/a values in the free throw % (ft_percent) column
nba_player_totals[nba_player_totals['ft_percent'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
17,31153,2024,5028,Alondes Williams,SG,24.0,2,NBA,MIA,1,6.0,0,1,0.000,0.0,1.0,0.000,0,0,0.000,0.000,0,0,,0.0,1.0,1.0,0,0.0,0.0,1.0,0,0
83,31219,2024,4700,Charlie Brown Jr.,SG,26.0,4,NBA,NYK,6,33.0,2,9,0.222,2.0,6.0,0.333,0,3,0.000,0.333,0,0,,1.0,1.0,2.0,0,0.0,2.0,2.0,5,6
120,31256,2024,4487,Daniel Theis,C,31.0,7,NBA,IND,1,8.0,1,4,0.250,0.0,1.0,0.000,1,3,0.333,0.250,0,0,,0.0,0.0,0.0,0,0.0,0.0,0.0,1,2
125,31261,2024,3867,Danny Green,SG,36.0,15,NBA,PHI,2,18.0,0,2,0.000,0.0,1.0,0.000,0,1,0.000,0.000,0,0,,0.0,2.0,2.0,1,1.0,0.0,0.0,1,0
133,31269,2024,5041,David Roddy,PF,22.0,2,NBA,PHO,1,10.0,2,3,0.667,1.0,2.0,0.500,1,1,1.000,0.833,0,0,,0.0,1.0,1.0,0,0.0,0.0,1.0,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30778,969,1951,430,Ed Beach,F,22.0,1,NBA,TRI,1,,0,3,0.000,,,,0,3,0.000,0.000,0,0,,,,0.0,1,,,,0,0
31077,767,1950,366,Jim Nolan,C,22.0,1,NBA,PHW,5,,4,21,0.190,,,,4,21,0.190,0.190,0,0,,,,,4,,,,14,8
31116,806,1950,113,Lee Knorek,C,28.0,4,NBA,BLB,1,,0,2,0.000,,,,0,2,0.000,0.000,0,0,,,,,0,,,,4,0
31144,834,1950,391,Murray Mitchell,C,26.0,1,NBA,AND,2,,1,3,0.333,,,,1,3,0.333,0.333,0,0,,,,,2,,,,1,2


In [191]:
#looks like the same easy fix.  0 ft/any number of ft attempts = ft_percent of 0.
nba_player_totals.fillna({'ft_percent':0}, inplace = True)
nba_player_totals['ft_percent'].isnull().sum()

0

In [192]:
#address the 312 n/a values in the total rebounds (trb) column. This stat wasn't tracked until 1950-1951 season.
nba_player_totals[nba_player_totals['trb'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
30893,583,1950,296,Al Cervi,PG,32.0,1,NBA,SYR,56,,143,431,0.332,,,,143,431,0.332,0.332,287,346,0.829,,,,264,,,,223,573
30894,584,1950,297,Al Guokas,F-G,24.0,1,NBA,TOT,57,,93,299,0.311,,,,93,299,0.311,0.311,28,50,0.56,,,,95,,,,143,214
30895,585,1950,297,Al Guokas,F-G,24.0,1,NBA,DNN,41,,86,271,0.317,,,,86,271,0.317,0.317,25,47,0.532,,,,85,,,,116,197
30896,586,1950,297,Al Guokas,F-G,24.0,1,NBA,PHW,16,,7,28,0.25,,,,7,28,0.25,0.25,3,3,1.0,,,,10,,,,27,17
30897,587,1950,298,Al Miksis,C,21.0,1,NBA,WAT,8,,5,21,0.238,,,,5,21,0.238,0.238,17,21,0.81,,,,4,,,,22,27
30898,588,1950,299,Alex Groza,C,23.0,1,NBA,INO,64,,521,1090,0.478,,,,521,1090,0.478,0.478,454,623,0.729,,,,162,,,,221,1496
30899,589,1950,300,Alex Hannum,PF,26.0,1,NBA,SYR,64,,177,488,0.363,,,,177,488,0.363,0.363,128,186,0.688,,,,129,,,,264,482
30900,590,1950,203,Andrew Levane,F-G,29.0,2,NBA,SYR,60,,139,418,0.333,,,,139,418,0.333,0.333,54,85,0.635,,,,156,,,,106,332
30901,591,1950,204,Andy Duncan,F-C,27.0,2,NBA,ROC,67,,125,289,0.433,,,,125,289,0.433,0.433,60,108,0.556,,,,42,,,,160,310
30902,592,1950,301,Andy O'Donnell,G,24.0,1,NBA,BLB,25,,38,108,0.352,,,,38,108,0.352,0.352,14,18,0.778,,,,17,,,,32,90


In [None]:
#as suspected they are all the 1949-1950 season rows.
#keep for now so as not to lose this season of play, but consider dropping after EDA and before modeling. 

In [None]:
# steals(stl), blocks(blk), offensive rebounds (orb), and defensive rebounds (drb) were all introduced
# as recorded stats in 1974.  Keep this N/A values as is.  Consider dropping only after EDA.

In [193]:
#address the 501 n/a values in the minutes played (mp) column. This stat wasn't tracked until 1951-1952 season.
nba_player_totals[nba_player_totals['mp'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
30704,895,1951,296,Al Cervi,PG,33.0,2,NBA,SYR,53,,132,346,0.382,,,,132,346,0.382,0.382,194,237,0.819,,,152.0,208,,,,180,458
30705,896,1951,417,Alan Sawyer,F,23.0,1,NBA,WSC,33,,87,235,0.370,,,,87,235,0.370,0.370,43,50,0.860,,,121.0,25,,,,75,217
30706,897,1951,299,Alex Groza,C,24.0,2,NBA,INO,66,,492,1046,0.470,,,,492,1046,0.470,0.470,445,566,0.786,,,709.0,156,,,,237,1429
30707,898,1951,300,Alex Hannum,PF,27.0,2,NBA,SYR,63,,182,494,0.368,,,,182,494,0.368,0.368,107,197,0.543,,,301.0,119,,,,271,471
30708,899,1951,204,Andy Duncan,F-C,28.0,3,NBA,BOS,14,,7,40,0.175,,,,7,40,0.175,0.175,15,22,0.682,,,30.0,8,,,,32,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31200,890,1950,414,Warren Perkins,G-F,27.0,1,NBA,TRI,60,,128,422,0.303,,,,128,422,0.303,0.303,115,195,0.590,,,,114,,,,260,371
31201,891,1950,415,Wayne See,G,26.0,1,NBA,WAT,61,,113,303,0.373,,,,113,303,0.373,0.373,94,135,0.696,,,,143,,,,147,320
31202,892,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TOT,59,,120,336,0.357,,,,120,336,0.357,0.357,73,115,0.635,,,,143,,,,127,313
31203,893,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TRI,26,,40,116,0.345,,,,40,116,0.345,0.345,29,46,0.630,,,,36,,,,48,109


In [194]:
#as suspected all rows missing the mp value are from the 1949-1950, and 1950-1951 seasons.
#keep as is for now. Reconsider after EDA, possibly fill with mean mp for 1951-1952 seasons.

In [197]:
#address the 5052 n/a values in the turnover (tov) column. This stat wasn't tracked until 1977-1978 season.
nba_player_totals[nba_player_totals['tov'].isnull()]

Unnamed: 0,seas_id,season,player_id,player,pos,age,experience,lg,tm,g,mp,fg,fga,fg_percent,x3p,x3pa,x3p_percent,x2p,x2pa,x2p_percent,e_fg_percent,ft,fta,ft_percent,orb,drb,trb,ast,stl,blk,tov,pf,pts
24515,6916,1977,1548,Aaron James,SF,24.0,3,NBA,NOJ,52,1059.0,238,486,0.490,,,,238,486,0.490,0.490,89,114,0.781,56.0,130.0,186.0,55,20.0,5.0,,127,565
24516,6917,1977,1692,Adrian Dantley,SF,21.0,1,NBA,BUF,77,2816.0,544,1046,0.520,,,,544,1046,0.520,0.520,476,582,0.818,251.0,336.0,587.0,144,91.0,15.0,,215,1564
24517,6918,1977,1549,Al Eberhard,SF,24.0,3,NBA,DET,68,1219.0,181,380,0.476,,,,181,380,0.476,0.476,109,138,0.790,76.0,145.0,221.0,50,45.0,15.0,,197,471
24518,6919,1977,1550,Al Skinner,SG,24.0,3,NBA,NYN,79,2256.0,382,887,0.431,,,,382,887,0.431,0.431,231,292,0.791,112.0,251.0,363.0,289,103.0,53.0,,279,995
24519,6920,1977,1693,Alex English,SF,23.0,1,NBA,MIL,60,648.0,132,277,0.477,,,,132,277,0.477,0.477,46,60,0.767,68.0,100.0,168.0,25,17.0,18.0,,78,310
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31200,890,1950,414,Warren Perkins,G-F,27.0,1,NBA,TRI,60,,128,422,0.303,,,,128,422,0.303,0.303,115,195,0.590,,,,114,,,,260,371
31201,891,1950,415,Wayne See,G,26.0,1,NBA,WAT,61,,113,303,0.373,,,,113,303,0.373,0.373,94,135,0.696,,,,143,,,,147,320
31202,892,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TOT,59,,120,336,0.357,,,,120,336,0.357,0.357,73,115,0.635,,,,143,,,,127,313
31203,893,1950,416,Whitey Von Nieda,G-F,27.0,1,NBA,TRI,26,,40,116,0.345,,,,40,116,0.345,0.345,29,46,0.630,,,,36,,,,48,109


In [198]:
#as suspected the data is only missing when it wasn't a tracked statistic.  How import is this stat?
nba_player_totals['tov'].mean()
#this is a lot higher career average than I suspected for all NBA players.  Explore further in EDA step.

68.84935753620232

In [199]:
nba_player_totals.isnull().sum()

seas_id            0
season             0
player_id          0
player             0
pos                0
age                0
experience         0
lg                 0
tm                 0
g                  0
mp               501
fg                 0
fga                0
fg_percent         0
x3p             5770
x3pa            5770
x3p_percent     5770
x2p                0
x2pa               0
x2p_percent        0
e_fg_percent       0
ft                 0
fta                0
ft_percent         0
orb             3900
drb             3900
trb              312
ast                0
stl             3900
blk             3900
tov             5052
pf                 0
pts                0
dtype: int64

In [None]:
#player data wrangled and ready for EDA.
#still need to wrangle the team data.