# Consolidate Player Data
**This module consolidates all of the data from the player using all of the gamelogs pulled using "Player_Data_Scraper"**.
It also has to correct for some issues with the NBA API. This is a running list since we don't know upfront where the errors are nor we need to correct for all of them. If the counting stats that matter align to the NBA.com totals and the missing data are not required we will ignore it.

Notably:
- Time was not tracked in minutes and seconds for every year. It also needs to be adjusted to convert to a time that works with the Pandas Time object.
- Some player log data is missing, but only for some stats. This was confirmed comparing API pulls of gamelogs (missing data) against Basketball reference and the NBA Gamelog by player. These are flagged and exported for a pull in separate file. It is later added back to see if that will fill the data gaps, which indicates there may be an issue with the original query for gamelogs, or, if the data are equivalent, it likely means there is an issue with the source data on NBA.com.

Here are some links to verify the issue with the source data for the April 6, 1985 game between the Kansas City Kings and Golden State Warriors.

[Basketball reference gamelog][c96d4877]

  [c96d4877]: https://www.basketball-reference.com/boxscores/198504060GSW.html "Basketball Reference"

[NBA Gamelog][5235feee]

  [5235feee]: https://stats.nba.com/game/0028400895/ "NBA Gamelog"

[Larry Smith individual gamelogs][5736c897]

  [5736c897]: https://stats.nba.com/player/78195/boxscores/?Season=1984-85&SeasonType=Regular%20Season "LarrySmith"


## Import the necessarily modules and get the working directory

In [746]:
import glob
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', None)
import os
import datetime

In [None]:
#set the subdirectory to save the tables
savedir=os.getcwd()+'\\datatables\\gamelogs\\'
fileList = os.listdir(savedir)

## Open the individual Gamelogs

In [None]:
df=pd.DataFrame()
for i in fileList:
    print(i)
    df_read=pd.read_parquet(savedir+i)
    df=pd.concat([df,df_read],sort=False)




## Correct for the errors with minute data

In [None]:
'''This is to correct for some issues with the minutes
1) Set minutes = 0 if there is a DNP
2) Add 0 to seconds for the years where seconds were not tracked
3) Leave the rest as-is to convert to a time value. Since NBA does not track hours need to manually caculate them

This will return an object that can be converted in to a time object hours-min-sec
'''
def date_fix(x=0):
    x=str(x)
    if len(x) ==0:     
        return '00:00:00'
    elif ":" in x:
        sep= x.find(':')
        if int(x[:(sep)]) < 60:
            return (f'00:{str(int(x[:(sep)])).zfill(2)}:{x[(sep+1):]}')
        else:
            return (f'01:{str(int(x[:(sep)])-60).zfill(2)}:{x[(sep+1):]}')

#update the dataframe
df['MIN_AD'] = df['MIN'].apply(date_fix)
df["MIN_AD"].fillna('00:00:00', inplace = True) 

## Identify the games with missing player data
- Group by team so we are working with a smaler dataset. _We cannot filter on players since it's possible a player could have no rebounds for a game._
- Identify the games with missing information. Every game should have some offensive or defensive rebounds if there are rebounds. (The rebounds do seem to be tracked).
- Flag the players in those specific games and export to a file for lookup. We can use a csv here since the file will be small.

In [779]:
# We are only interested in counting stats at the team level
team_var='FGM FGA FG3M FG3A FTM FTA OREB DREB REB AST STL BLK TO PF PTS'.split(' ')
#get the aggregate totals for both team by game
df_team_err_check = df.groupby(['SEASON','GAME_ID'],sort=True)[team_var].sum().reset_index()

In [772]:
df_team_err_check.shape

(88290, 20)

In [780]:
#We select all the games from the df_team_err_check and count the number of rows where there are no values.
df_team_select=['SEASON','GAME_ID']+team_var
df_team_melt=(df_team_err_check.loc[:,df_team_select]).melt(value_vars=team_var)
(df_team_melt[df_team_melt.value ==0]).variable.value_counts()

FG3M    2309
OREB    1292
DREB    1239
TO       678
STL      549
FG3A     482
BLK      291
FGA        8
PF         6
AST        3
REB        1
Name: variable, dtype: int64

## Looking at rebounds
We can see that rebounds look off since only one game has no rebounds but many don't have any offensive or defensive rebounds.

Three pointers are also low however it's not unreasonable that in some games no three pointers were attempted. ShotTracker has already charted this. We will focus on off/def rebounds and see if filtering for that removes most of the suspicious results.

[3 Point Data from ShotTracker][dd2e3af9]

  [dd2e3af9]: https://shottracker.com/articles/the-3-point-revolution "ThreePt"

  Filtering for the games where there were no offensive or defensive rebounds appears to account for most of the errors.        

In [792]:
df_team_melt2=(df_team_err_check.loc[(df_team_err_check.OREB==0)|(df_team_err_check.DREB==0)].melt(value_vars=team_var))
(df_team_melt2[df_team_melt2.value ==0]).variable.value_counts()

OREB    1292
DREB    1239
TO       631
FG3M     509
STL      488
FG3A     355
BLK      243
FGA        6
PF         4
AST        3
REB        1
Name: variable, dtype: int64

## Checking to see if the missing values are related?
Since the GAME_ID is sorted by date "002" prefix + "YY" + "GAME#" we can see where the last zero rebound data falls, which is in 1984-85.
We can confirm this has something to do with data prior to 1984-85 by repeating the same exercise at the team level, but for data starting in the 1985-86 season.

In [796]:
df_team_err_check.loc[(df_team_err_check.OREB==0)|(df_team_err_check.DREB==0)].tail()

Unnamed: 0,SEASON,GAME_ID,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS
2819,1982-1983,28200934,103.0,189.0,1.0,5.0,45.0,61.0,0.0,0.0,83.0,68.0,11.0,3.0,31.0,59.0,252.0
2820,1982-1983,28200935,92.0,190.0,2.0,9.0,57.0,69.0,0.0,0.0,88.0,59.0,15.0,10.0,44.0,49.0,243.0
2975,1983-1984,28300147,92.0,194.0,2.0,7.0,59.0,69.0,0.0,0.0,98.0,49.0,16.0,13.0,7.0,54.0,245.0
3353,1983-1984,28300525,75.0,171.0,0.0,0.0,68.0,80.0,0.0,0.0,99.0,51.0,11.0,10.0,21.0,53.0,218.0
4665,1984-1985,28400895,98.0,180.0,2.0,0.0,68.0,77.0,0.0,0.0,74.0,55.0,10.0,8.0,13.0,54.0,266.0


## What does removing the data prior to 1984-85 do?

Removing this data seems to address the issues. Given teams averaged less than 5 three-pointers per game prior to 1990 it's not unreasonable that there were 17 games where no team attempted a three-point shot and 617 where they didn't make any.

At this point we can look at the blocks manually to see if there are data issues or the totals are correct. Once we filter the dataframe we can get its shape to get the number of rows (39,431).

In [917]:
season_filter=['1980-1981', '1981-1982', '1982-1983', '1983-1984', '1984-1985']
df_team_melt3=df_team_err_check.loc[~df_team_err_check.SEASON.isin(season_filter),:].melt(value_vars=team_var)
(df_team_melt3[df_team_melt3.value ==0]).variable.value_counts()

FG3M    613
FG3A     17
BLK      14
Name: variable, dtype: int64

In [939]:
# ~ is the same as "not in"
df_team_err_check.loc[(~df_team_err_check.SEASON.isin(season_filter)),:].shape

(39431, 17)

## Blocks data
Since there are only 14 games with no blocks, we can go the NBA website to check if the data are correct by changing the URL below. This confirms that there were no blocks when the Raptors Played the Pistons on October 26, 2016.
https://stats.nba.com/game/0021600007/

I also tried a couple more including 0029500085 and 0028800002. In both cases there were 0 blocks and it's reasonable that the other 11 games with no blocks are correct. (We could do some statistics here but given we are looking at 11 games out of 39431 games it's not really necessary.)

In [918]:
df_team_err_check.loc[(~df_team_err_check.SEASON.isin(season_filter))&(df_team_err_check.BLK==0),:]

Unnamed: 0,SEASON,GAME_ID,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS
6998,1987-1988,28700400,91.0,155.0,0.0,3.0,38.0,50.0,18.0,42.0,60.0,55.0,23.0,0.0,36.0,45.0,220.0
7543,1988-1989,28800002,82.0,170.0,7.0,12.0,58.0,65.0,31.0,46.0,77.0,47.0,22.0,0.0,30.0,52.0,229.0
7545,1988-1989,28800004,92.0,195.0,3.0,10.0,50.0,64.0,36.0,63.0,99.0,39.0,21.0,0.0,49.0,58.0,237.0
8162,1988-1989,28800621,77.0,178.0,5.0,17.0,41.0,56.0,34.0,61.0,95.0,52.0,19.0,0.0,32.0,44.0,200.0
8427,1988-1989,28800886,86.0,208.0,3.0,9.0,48.0,68.0,39.0,78.0,117.0,55.0,23.0,0.0,39.0,45.0,223.0
8490,1988-1989,28800949,81.0,165.0,3.0,12.0,53.0,62.0,28.0,52.0,80.0,43.0,21.0,0.0,36.0,53.0,218.0
12731,1992-1993,29200844,100.0,181.0,6.0,18.0,42.0,54.0,25.0,54.0,79.0,61.0,16.0,0.0,20.0,42.0,248.0
13366,1993-1994,29300372,60.0,142.0,3.0,19.0,55.0,76.0,29.0,53.0,82.0,29.0,13.0,0.0,32.0,60.0,178.0
15293,1995-1996,29500085,67.0,154.0,13.0,36.0,35.0,52.0,17.0,66.0,83.0,43.0,12.0,0.0,28.0,47.0,182.0
28490,2006-2007,20600585,74.0,150.0,9.0,38.0,44.0,64.0,21.0,55.0,76.0,42.0,7.0,0.0,27.0,48.0,201.0


## Three pointers made
- Three-pointers are a bit tricker. We know there are 613 games where no team attempted at least one three-point shot but didn't hit any after 1985-86.
- We also know that the rise in three-point shots is a more recent phenomenon. 
- We also know that give or take teams have historically hit ~35% of their three point shots, https://www.besttickets.com/blog/nba-shooting/.

We can do some things in Python for a sanity check.
 1. We assume three pointers made are independent events.
 2. On average slightly more than 1/3 of three-point shots are successful. Before we use that to see if totals look reasonable lets see if we can reduce the timeframe for analysis even--we know that teams have become a lot better at shooting three pointers over the years so it would be better to use a more precise calculation since we are only looking at 617 games.
 3. See over a 39431 games (games after 1985) what the odds are that given x number of shots none are made.

Setting this threshold to 15 drops the number to 5 games, which we can manually look up. (I confirmed games 0020300138, 0029100999, 0029100715, 0029100004, and 0029000408 all had 15 3PA with no makes.)

In [914]:
df_team_err_check.loc[(~df_team_err_check.SEASON.isin(season_filter))&(df_team_err_check.FG3M==0)&(df_team_err_check.FG3A>=15),:]

Unnamed: 0,SEASON,GAME_ID,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS
10081,1990-1991,29000408,85.0,188.0,0.0,16.0,41.0,54.0,31.0,61.0,92.0,44.0,25.0,9.0,40.0,52.0,211.0
10784,1991-1992,29100004,88.0,177.0,0.0,16.0,24.0,32.0,20.0,62.0,82.0,63.0,21.0,10.0,33.0,37.0,200.0
11495,1991-1992,29100715,80.0,172.0,0.0,15.0,31.0,44.0,22.0,70.0,92.0,54.0,19.0,6.0,31.0,33.0,191.0
11779,1991-1992,29100999,70.0,150.0,0.0,15.0,47.0,59.0,32.0,36.0,68.0,43.0,33.0,10.0,51.0,44.0,187.0
24394,2003-2004,20300138,62.0,154.0,0.0,21.0,52.0,80.0,20.0,68.0,88.0,32.0,20.0,7.0,36.0,60.0,176.0


## Reducing the scope of the search 
Before we further look at the data let's se if we can reduce the scope of the search bit further.

Looking at games with < 14 three pointers attempted there are only two games after the 1998-98 season where none went in 0020300989  0020200883.

Both of those checked out. **The remaining 608 games all took place between 1985-86 and 1998-99**. Let's focus on those years and pull out the specific games in a list.

In [926]:
df_team_err_check.loc[(~df_team_err_check.SEASON.isin(season_filter))&(df_team_err_check.FG3M==0)&(df_team_err_check.FG3A<=14),:]

Unnamed: 0,SEASON,GAME_ID,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS
4718,1985-1986,0028500005,81.00,170.00,0.00,5.00,29.00,44.00,27.00,57.00,84.00,46.00,19.00,12.00,33.00,42.00,191.00
4719,1985-1986,0028500006,78.00,168.00,0.00,2.00,56.00,81.00,31.00,59.00,90.00,36.00,12.00,9.00,38.00,66.00,212.00
4724,1985-1986,0028500011,86.00,171.00,0.00,2.00,48.00,72.00,27.00,54.00,81.00,53.00,20.00,16.00,35.00,55.00,220.00
4728,1985-1986,0028500015,104.00,224.00,0.00,6.00,51.00,71.00,38.00,70.00,108.00,57.00,19.00,7.00,40.00,66.00,259.00
4731,1985-1986,0028500018,96.00,186.00,0.00,4.00,43.00,61.00,29.00,58.00,87.00,41.00,26.00,16.00,39.00,48.00,235.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18277,1997-1998,0029700691,72.00,168.00,0.00,14.00,30.00,40.00,29.00,58.00,87.00,46.00,18.00,6.00,35.00,40.00,174.00
19043,1998-1999,0029800268,75.00,172.00,0.00,11.00,34.00,50.00,33.00,59.00,92.00,37.00,15.00,11.00,26.00,40.00,184.00
19107,1998-1999,0029800332,54.00,148.00,0.00,6.00,59.00,82.00,32.00,59.00,91.00,36.00,19.00,11.00,30.00,56.00,167.00
23950,2002-2003,0020200883,61.00,151.00,0.00,14.00,26.00,39.00,18.00,64.00,82.00,33.00,21.00,10.00,34.00,39.00,148.00


    ## 1986-87 to 1998-99 Data
    Filtering the data we are   now looking at 13845 games.

In [936]:
# df.SEASON.unique() we can use this to get a list of the seasons and paste the ones we want into a new filter
season_filter_3p=['1986-1987', '1987-1988', '1988-1989', '1989-1990',
       '1990-1991', '1991-1992', '1992-1993', '1993-1994', '1994-1995',
       '1995-1996', '1996-1997', '1997-1998', '1998-1999']

In [996]:
#The games where no three point shots were made
game_list_3p=list((df_team_err_check.loc[(df_team_err_check.SEASON.isin(season_filter_3p))&(df_team_err_check.FG3M==0)&(df_team_err_check.FG3A<=14),:]).GAME_ID)

In [997]:
df_3pt=df.loc[(df.SEASON.isin(season_filter_3p)),:]
df_3pt.GAME_ID.nunique() #number of games in the revised sample

13845

In [1005]:
df_3pt_team=df_3pt.groupby(['SEASON','TEAM_ABBREVIATION'])['FG3M','FG3A'].sum().reset_index()
df_3pt_team['3P%']=df_3pt_team.FG3M/df_3pt_team.FG3A
df_3pt_team.head()

Unnamed: 0,SEASON,TEAM_ABBREVIATION,FG3M,FG3A,3P%
0,1986-1987,ATL,135.0,425.0,0.32
1,1986-1987,BOS,207.0,565.0,0.37
2,1986-1987,CHI,78.0,297.0,0.26
3,1986-1987,CLE,81.0,338.0,0.24
4,1986-1987,DAL,232.0,653.0,0.36


In [1009]:
#need to filter out because each team has multiple players
(df.loc[df.GAME_ID.isin(game_list_3p),['SEASON','TEAM_ABBREVIATION','GAME_ID']]).groupby(['SEASON','TEAM_ABBREVIATION']).count().reset_index()

Unnamed: 0,SEASON,TEAM_ABBREVIATION,GAME_ID
0,1986-1987,ATL,98
1,1986-1987,BOS,20
2,1986-1987,CHI,160
3,1986-1987,CLE,258
4,1986-1987,DAL,28
...,...,...,...
155,1997-1998,SAC,12
156,1998-1999,ATL,12
157,1998-1999,BOS,11
158,1998-1999,MIN,12


In [910]:
import numpy as np
def threept(shot_att):
    sample=39431
    shots_m=0
    for i in range(0,sample):
        x=(np.random.randint(0,high=3, size=shot_att))
        if sum(x) > 0:
            shots_m+=1
    return(sample-shots_m)
            
for i in range(10):
    threept(5)

173

In [875]:
'''determine which games have no rebounds and use for a lookup
1) Assume errors where offensive rebounds or defensive rebounds = 0 for a game
2) Flag the games where this occurs
3) Flag the playerID
4) Save to a CSV for lookup
'''

revise_logs=(df_team.loc[(df_team['OREB']+df_team['DREB'])==0,['GAME_ID']]).values.tolist()
revise_logs=[item for sublist in revise_logs for item in sublist]
df_error=df.loc[df['GAME_ID'].isin(revise_logs),['GAME_ID','SEASON','PLAYER_ID']]

error_path=savedir+'\\Error_Lookup\\'
df_error.to_csv(f'{error_path}missing_data.csv')

## Merge the missing data
- We need to reopen the missing data then do a merge to update the gamelogs.
- Once that is complete we need to confirm that this did not introduce any additional errors before overwriting the data. If the new data corrects for missing data in the original pull without damaging any of the existing data then we can safely add it in and move to calculating the team stats, notably estimating possessions.

In [650]:
df_missing_pl=pd.read_parquet(f'{error_path}df_missing_players.gzip')
#Make the columns uppercase to for the join
df_missing_pl.columns=map(lambda x: str(x).upper(), df_missing_pl.columns)
df_missing_pl.head()

Unnamed: 0,SEASON_ID,PLAYER_ID,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,21982,76385,28200942,"APR 17, 1983",PHL @ BOS,L,26,3,8.0,0.38,0.0,0.0,0.0,4,4,1.0,1.0,3.0,4.0,7.0,1.0,1.0,0.0,0.0,10,,0
1,21982,76385,28200930,"APR 15, 1983",PHL @ NJN,W,26,5,7.0,0.71,0.0,0.0,0.0,2,2,1.0,,,3.0,4.0,,0.0,,1.0,12,,0
2,21982,76385,28200909,"APR 13, 1983",PHL vs. WAS,L,23,6,10.0,0.6,0.0,0.0,0.0,0,0,0.0,1.0,1.0,2.0,2.0,2.0,0.0,1.0,0.0,12,,0
3,21982,76385,28200907,"APR 12, 1983",PHL @ ATL,L,31,3,7.0,0.43,0.0,0.0,0.0,4,4,1.0,1.0,1.0,2.0,6.0,1.0,0.0,2.0,2.0,10,,0
4,21982,76385,28200895,"APR 10, 1983",PHL vs. NYK,W,39,6,7.0,0.86,0.0,0.0,0.0,2,2,1.0,0.0,2.0,2.0,9.0,6.0,1.0,2.0,1.0,14,,0


In [659]:
df_merge_cols= ['SEASON_ID','PLAYER_ID', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL',
       'MIN', 'FGM', 'FGA', 'FG3M', 'FG3A','FTM', 'FTA', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
       'PTS', 'PLUS_MINUS']

In [674]:
df_merge = pd.merge(df,df_missing_pl[df_merge_cols], left_on=['GAME_ID','PLAYER_ID'], right_on=['GAME_ID','PLAYER_ID'],suffixes=('', '_new'),how='left')

## Review the rebound column

In [705]:
df_merge[((df_merge.OREB+df_merge.DREB)==0)&((df_merge.OREB_new+df_merge.DREB_new)>0)]


Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,PLAYER_ID,PLAYER_NAME,START_POSITION,COMMENT,MIN,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS,GAME_DATE_EST,SEASON,HM_AW,OPP,WINNER,PLAYER_WIN_OR_LOSE,OFF_RATING,DEF_RATING,NET_RATING,AST_PCT,AST_TOV,AST_RATIO,OREB_PCT,DREB_PCT,REB_PCT,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,PACE,PACE_PER40,POSS,PIE,MIN_AD,SEASON_ID,GAME_DATE,MATCHUP,WL,MIN_new,FGM_new,FGA_new,FG3M_new,FG3A_new,FTM_new,FTA_new,OREB_new,DREB_new,REB_new,AST_new,STL_new,BLK_new,TOV,PF_new,PTS_new,PLUS_MINUS_new


## Results
After comparing this data to the original dataset we confirmed that both queries return incomplete results for rebounds before 1985.
- We know Basketball Reference has the information however since scraping it won't add that much value we will look at interpolating it in the other file.
- If we can't find a good way of interpolating the data then we will just ignore those records since they won't have an impact on the future machine learning exercises. It will, however, impact what we can visualize going back in time.
- We can be more confident our original query wasn't the issue since we tried something different and returned the same results. Given the number of items we looked at that indicates a likely issue with the source data.

For now we will ignore the data since we can infer that the both the player logs and game logs are pulling from the same source. Basketball reference has either manually corrected for the data error or interpolated it. We will look at whether there is a way of reasonably interpolating the data since we do have the denominator (total rebounds).

In [None]:
df_team = df.groupby(['SEASON','GAME_ID','HM_AW','TEAM_ID','TEAM_ABBREVIATION'],sort=True)[team_var].sum().reset_index()

## Add Possessions Data at the team level
Possession data can be estimated using the Basketball Reference formula at the team level. At the player level we will use the individual advanced stats provided by NBA.com for now.
> possessions = 0.5 * ((Tm FGA + 0.4 * Tm FTA – 1.07 * (Tm ORB / (Tm ORB + Opp DRB))
        * (Tm FGA – Tm FG) + Tm TOV)
       + (Opp FGA + 0.4 * Opp FTA – 1.07 
          * (Opp ORB / (Opp ORB + Tm DRB))
          * (Opp FGA – Opp FG) + Opp TOV))

In [None]:
'''
This builds the possessions for the home and away teams. The team dataframes are sorted by season--game--home/away.
There are two rows per game (Home/Away)
Away is always first. We can use the dataframe keys to build a dictionary with the possession count for the game for each team.
We use the variables 'a' and 'h' as selectors for the Away and Home rows, then calculate as per the Basketball reference formula.
'''


from IPython.display import display
game_list=[i for i in df_team.GAME_ID]

for i in game_list[:16:2]:
    df_pos=(df_team[(df_team.GAME_ID==i)]) #temporary df for away data
    display(df_pos)
    #set rows for aggregation
    a=df_pos.loc[df_pos.index[0]]
    h=df_pos.loc[df_pos.index[1]]
    h_pos=0.5 * ((h.FGA + 0.4 * h.FTA - 1.07 * (h.OREB / (h.OREB + a.DREB)) * (h.FGA - h.FGM) + h.TO) + (a.FGA + 0.4 * a.FTA - 1.07 * (a.OREB / (a.OREB + h.DREB)) * (a.FGA - a.FGM) + a.TO))
    a_pos=0.5 * ((a.FGA + 0.4 * a.FTA - 1.07 * (a.OREB / (a.OREB + h.DREB)) * (a.FGA - a.FGM) + a.TO) + (h.FGA + 0.4 * h.FTA - 1.07 * (h.OREB / (h.OREB + a.DREB)) * (h.FGA - h.FGM) + h.TO))
    print(a_pos,h_pos,i)

## Add player images
Player images can be identified using the path url, https://stats.nba.com/player/ + the player ID.

Here is a screenshot of LeBron James which we can identify via the Chrome Console.
https://github.com/tkpca/NBA_Data/blob/master/Images/Finding%20Player%20Photos.png

In [1016]:
df['HEADSHOT']= 'https://stats.nba.com/player/' + df.PLAYER_ID.astype(str)
df.head()

Unnamed: 0_level_0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,PLAYER_ID,PLAYER_NAME,START_POSITION,COMMENT,MIN,FGM,FGA,FG3M,FG3A,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS,GAME_DATE_EST,SEASON,HM_AW,OPP,WINNER,PLAYER_WIN_OR_LOSE,OFF_RATING,DEF_RATING,NET_RATING,AST_PCT,AST_TOV,AST_RATIO,OREB_PCT,DREB_PCT,REB_PCT,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,PACE,PACE_PER40,POSS,PIE,MIN_AD,HEADSHOT
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
0,28000001,1610612756,PHX,76011,Alvan Adams,,,29,10.0,15.0,0.0,0.0,5.0,6.0,1.0,7.0,8.0,6.0,3.0,0.0,3.0,2.0,25.0,,1980-10-10T00:00:00,1980-1981,H,GSW,H,1,,,,,,,,,,,,,,,,,,00:00:00,https://stats.nba.com/player/76011
1,28000001,1610612756,PHX,78095,Alvin Scott,,,12,0.0,0.0,0.0,0.0,2.0,4.0,0.0,3.0,3.0,2.0,0.0,1.0,0.0,1.0,2.0,,1980-10-10T00:00:00,1980-1981,H,GSW,H,1,,,,,,,,,,,,,,,,,,00:00:00,https://stats.nba.com/player/78095
2,28000001,1610612756,PHX,77141,Dennis Johnson,,,31,10.0,17.0,0.0,0.0,8.0,9.0,4.0,1.0,5.0,3.0,2.0,0.0,2.0,3.0,28.0,,1980-10-10T00:00:00,1980-1981,H,GSW,H,1,,,,,,,,,,,,,,,,,,00:00:00,https://stats.nba.com/player/77141
3,28000001,1610612756,PHX,76436,Jeff Cook,,,33,2.0,3.0,0.0,0.0,0.0,0.0,1.0,6.0,7.0,1.0,1.0,0.0,3.0,3.0,4.0,,1980-10-10T00:00:00,1980-1981,H,GSW,H,1,,,,,,,,,,,,,,,,,,00:00:00,https://stats.nba.com/player/76436
4,28000001,1610612756,PHX,77308,Joel Kramer,,,16,2.0,2.0,0.0,0.0,2.0,4.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,1.0,6.0,,1980-10-10T00:00:00,1980-1981,H,GSW,H,1,,,,,,,,,,,,,,,,,,00:00:00,https://stats.nba.com/player/77308


In [1029]:
#days rest
df_rest=df[['SEASON','PLAYER_ID','GAME_ID','GAME_DATE_EST']]
df_rest.head()

Unnamed: 0_level_0,SEASON,PLAYER_ID,GAME_ID,GAME_DATE_EST
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1980-1981,76011,28000001,1980-10-10T00:00:00
1,1980-1981,78095,28000001,1980-10-10T00:00:00
2,1980-1981,77141,28000001,1980-10-10T00:00:00
3,1980-1981,76436,28000001,1980-10-10T00:00:00
4,1980-1981,77308,28000001,1980-10-10T00:00:00


## Save the updated file
Save the updated file with the data corrected. It can now be used for further visualization and be processed for machine learning.

In [None]:
df.to_parquet(f'{savedir}df_all_gamelogs.gzip',compression='gzip')
