## NBA shot logs

The data present at this notebook was taken from https://www.kaggle.com/dansbecker/nba-shot-logs <br\> The purpouse of this notebook is to explore/understand the data, trying to find any correlations or extract relevant information from it.

By Eloi Ugo Pattaro <br/> contact information: eloi@pattaro.com.br <br/> last update: 16/09/2016

In [None]:
## importing libraries and functions ##

import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
import statsmodels.api as sm

import pylab

from scipy.optimize import curve_fit

%matplotlib inline

### Loading Data

Once downloaded and unziped we load the data

In [None]:
DF=pd.read_csv('../input/shot_logs.csv')

#DF=copy.deepcopy(pd.read_csv('shot_logs.csv'))

### Understanding DF features

Here we print the basic information about the DataFrame. <br\> Its size, the features labels, a sample value of each and its type. <br\> Since its a small dataframe (only 21 features) its easy to visualize.

Afterwards we check for any missing data (NaN) within our DataFrame.

We also check (in hindsight) for senseless/wrong data (negative time) and remove it.

In [None]:
print ('columns:', DF.shape[1], '\n')
print ('DF length:', len(DF), '\n')

print ('printing columns names and sample data:\n')

for i in DF.columns:
    print (i,DF[i][0],type(DF[i][0]))

In [None]:
DF.isnull().any()

In [None]:
print (len (DF))
print (len (DF[DF.SHOT_CLOCK.isnull()==True]))

print (round(float(len (DF[DF.SHOT_CLOCK.isnull()==True]))/float(len (DF)),2)*100,'%')

In [None]:
print (len (DF[DF['TOUCH_TIME']<0]))
print (round(len (DF[DF['TOUCH_TIME']<0])/float(len (DF)),3)*100,'%')

Issues found were:

MATCHUP: has date, home team, adversary team, should be 3 columns, one for each feature <br />
W: defines win or loss to the home team, not necessary as we have the final margin for the home team ( > 0 = win, < 0 = loss) <br />
SHOT_RESULT // FGM: they represent the same thing, one is written as a string, the other is binary <br />
SHOT_CLOCK: 4% of the data has NaN values.<br />
TOUCH_TIME: has negative values for 0.2% of the values, those will be discarded

### Data Pre Processing

Here we apply the fixes of the issues found above.

In [None]:
del DF['SHOT_RESULT']
del DF['W']

DF=DF[DF['TOUCH_TIME']>0]

In [None]:
def data_split(x):
    
    (a,b)=x.split('-')
    a=a.strip()
    
    return a
    
def home_team_split (x):

    (a,b)=x.split('-')
    
    if '@' in b:
        (b1,b2)=b.split('@')
    if 'vs.' in b:
        (b1,b2)=b.split('vs.')
        
    b1=b1.strip()
    return b1
    
def adversary_team_split (x):
    
    (a,b)=x.split('-')
    
    if '@' in b:
        (b1,b2)=b.split('@')
    if 'vs.' in b:
        (b1,b2)=b.split('vs.')

    b2=b2.strip()
    return b2

In [None]:
DF['date']=DF['MATCHUP'].apply(data_split)
DF['date']=DF['date'].apply(pd.to_datetime)
DF['home_team']=DF['MATCHUP'].apply(home_team_split)
DF['adv_team']=DF['MATCHUP'].apply(adversary_team_split)

## First Step: Planning

Since we dont have (yet) a specific question to (try to) answer what we should do is try to visualize some of the key components with the hope of raising some hypothesis to test.

So, first, lets review the features we have at hand.

In [None]:
list(DF.columns)

Okay. So a few key points come to mind as where to look at: 

- Are there players that are clearly better than others at scoring %? (beyond statistical variance)

- Do players who attempt more have better % performance?

- Does the game related features (LOCATION, SHOT_NUMBER, PERIOD, GAME_CLOCK, SHOT_CLOCK, TOUCH_TIME, SHOT_DIST, PTS_TYPE) affect the chance to FGM? If so, how?

- 

### Second Step: Processing and analyzing Data

To answer the first question we have to determine the FGM efficiency of each player. <br/> To estimate the error/uncertainty we will use the binomial interval of confidence (95%) by the Jeffrey method.

In [None]:
text_opts={'fontsize':20,'fontweight':'bold'}

In [None]:
def count_shots(x):
    y=len(DF[DF['player_id']==x])
    return y

def count_shots_made(x):
    dummy_DF=DF[DF['FGM']==1]
    y=len(dummy_DF[dummy_DF['player_id']==x])
    return y

def count_games(x):
    y=(DF[DF['player_id']==x])
    z=len(y.groupby('GAME_ID'))
    return z

def max_attempts_in_game(x):
    y=DF[DF['player_id']==x]
    z=y.groupby('GAME_ID').count()
    k=np.max(z)[0]
    return k

players=pd.DataFrame(list(set(DF['player_id'])))
players.columns=['player_id']

players['total_attempts']=players['player_id'].apply(count_shots)
players['FGM']=players['player_id'].apply(count_shots_made)
players['ratio_FGM']=players['FGM']/players['total_attempts']

players['ratio_FGM_low'],players['ratio_FGM_upp']=sm.stats.proportion_confint(players['FGM'], players['total_attempts'], method='jeffrey')

players['ratio_FGM_low']=players['ratio_FGM']-players['ratio_FGM_low']
players['ratio_FGM_upp']=players['ratio_FGM_upp']-players['ratio_FGM']

players['total_games']=players['player_id'].apply(count_games)
players['avg_attempts_per_game']=players['total_attempts']/players['total_games']
players['avg_FGM_per_game']=players['FGM']/players['total_games']
players['max_attempts_in_game']=players['player_id'].apply(max_attempts_in_game)



print (len(players), 'different players')

players=players.sort_values('ratio_FGM')

In [None]:
dummy=players.sort_values('ratio_FGM', ascending=False)

plt.figure(figsize=(20,10))

#plt.scatter(dummy.index,dummy.ratio_FGM.values)

plt.plot(dummy.ratio_FGM.values, 'ko')

plt.errorbar(np.arange(len(dummy)), dummy.ratio_FGM.values, yerr=[dummy['ratio_FGM_low'],dummy['ratio_FGM_upp']])

plt.grid()
plt.xticks([], **text_opts)
plt.yticks(**text_opts)

plt.ylim(0,1)
plt.xlim(-5,290)

plt.title('FGM % by player', **text_opts)
plt.ylabel('FGM %', **text_opts)
plt.xlabel('different players', **text_opts)

In [None]:
set (DF[DF['player_id'].isin(players[players['ratio_FGM']>.65]['player_id'])].player_name)

Okay, so through the graphic above we can clearly see that for the vast majority of players the skill value (FGM %) is exactly the same within statiscal variation. <br/> The exeption are the couple of 'best' players, which are clearly outliers.

Okay. So we have seen the FGM %. <br/> What about absolute values? Unfortunately our Data Base is incomplete. We do not have information on how long each player played for during each game he played. <br/> Such information would have been very usefull for analysing efficiency as points per minut or attempts per minuts.

We can however analyze whats the avg attempt/FGM per game as well as the maximum value for each player. <br/> We can also check if attempting more (or less) is related to the player FGM efficiency.

In [None]:
players.head()

In [None]:
plt.figure(figsize=(20,10))

dummy=players.sort_values('avg_attempts_per_game', ascending=False)

plt.plot(dummy['avg_attempts_per_game'].values, 'ko', color='black')
plt.plot(dummy['avg_attempts_per_game'].values*dummy['ratio_FGM'].values, 'ko', color='green')

plt.grid()
plt.xticks([], **text_opts)
plt.yticks(**text_opts)
plt.xlim(-5,290)

plt.title('players atempts, FGM and efficiency', **text_opts)
plt.ylabel('attempts & FGM', **text_opts)
plt.xlabel('different players', **text_opts)

plt.legend(['attempts','FGM'],markerscale=2, loc='upper left', prop={'size':24})

plt.twinx()

plt.plot(dummy['ratio_FGM'].values, 'ko', color='red')
plt.ylim(0,1)
plt.yticks(color='red', **text_opts)

plt.legend(['efficiency'],markerscale=2, prop={'size':24})

Despite having a high variance, we can see that the general trend of the efficiency (FGM out of attempts) increases as the number of attempts decreases. <br/> Possibly, that would imply that players who try fewer shots, only do so on easier opportunities.

The players who score the most points, in a general way, have the lowest efficiency. That might be because they are the ones defense focus the most on. <br/> Also. Its important to notice this analysis isnt ideal. We are doing an avg per game. Players that play only a quarter or few minuts will be impacted negatively by this.

Ideally we should have data on how long each player played for during each game, however, we dont.

Okay, we have already done some player analysis. <br/> Lets take a look at FGM vs features now. <br/> SHOT DISTANCE and TOUCH TIME look like the most interesting factors at first glance.

In [None]:
mades=DF[DF['FGM']==1]
missed=DF[DF['FGM']==0]

max_touch_time=np.max(DF['TOUCH_TIME'])
max_shot_dist=np.max(DF['SHOT_DIST'])

shot_distance_DF=pd.DataFrame(np.zeros(len(np.arange(0,max_shot_dist+0.1,0.1))))
touch_time_DF=pd.DataFrame(np.zeros(len(np.arange(0,max_touch_time+0.1,0.1))))

shot_distance_DF['distance']=np.arange(0,max_shot_dist+0.1,0.1)
touch_time_DF['time']=np.arange(0,max_touch_time+0.1,0.1)

def num_attempts_SHOT_DIST (x):
    
    z=DF[DF['SHOT_DIST']==x]
    k=len(z)
    
    return k

def num_attempts_TOUCH_TIME (x):
    
    z=DF[DF['TOUCH_TIME']==x]
    k=len(z)
    
    return k

def num_fgm_SHOT_DIST (x):
    
    z=mades[mades['SHOT_DIST']==x]
    k=len(z)
    
    return k

def num_fgm_TOUCH_TIME (x):
    
    z=mades[mades['TOUCH_TIME']==x]
    k=len(z)
    
    return k

shot_distance_DF['attempts']=shot_distance_DF['distance'].apply(num_attempts_SHOT_DIST)
touch_time_DF['attempts']=touch_time_DF['time'].apply(num_attempts_TOUCH_TIME)

shot_distance_DF['fgm']=shot_distance_DF['distance'].apply(num_fgm_SHOT_DIST)
touch_time_DF['fgm']=touch_time_DF['time'].apply(num_fgm_TOUCH_TIME)

shot_distance_DF['ratio']=shot_distance_DF['fgm']/shot_distance_DF['attempts'].fillna(0)
touch_time_DF['ratio']=touch_time_DF['fgm']/touch_time_DF['attempts'].fillna(0)

(shot_distance_DF['ratio_uncertainty_low'],shot_distance_DF['ratio_uncertainty_upp']) = sm.stats.proportion_confint(shot_distance_DF['fgm'], shot_distance_DF['attempts'], method='jeffrey')

shot_distance_DF['ratio_uncertainty_low']=shot_distance_DF['ratio']-shot_distance_DF['ratio_uncertainty_low']
shot_distance_DF['ratio_uncertainty_upp']=shot_distance_DF['ratio_uncertainty_upp']-shot_distance_DF['ratio']

(touch_time_DF['ratio_uncertainty_low'],touch_time_DF['ratio_uncertainty_upp']) = sm.stats.proportion_confint(touch_time_DF['fgm'], touch_time_DF['attempts'], method='jeffrey')

touch_time_DF['ratio_uncertainty_low']=touch_time_DF['ratio']-touch_time_DF['ratio_uncertainty_low']
touch_time_DF['ratio_uncertainty_upp']=touch_time_DF['ratio_uncertainty_upp']-touch_time_DF['ratio']

shot_distance_DF=shot_distance_DF.fillna(0)
touch_time_DF=touch_time_DF.fillna(0)

shot_distance_DF=shot_distance_DF[shot_distance_DF['attempts']>0]
touch_time_DF=touch_time_DF[touch_time_DF['attempts']>0]

In [None]:
tolerance=0.1

DF_1=shot_distance_DF[(tolerance>=shot_distance_DF['ratio_uncertainty_low'])&(tolerance>=shot_distance_DF['ratio_uncertainty_upp'])]
DF_2=shot_distance_DF[(tolerance<shot_distance_DF['ratio_uncertainty_low'])&(tolerance<shot_distance_DF['ratio_uncertainty_upp'])]

DF_3=touch_time_DF[(tolerance>=touch_time_DF['ratio_uncertainty_low'])&(tolerance>=touch_time_DF['ratio_uncertainty_upp'])]
DF_4=touch_time_DF[(tolerance<touch_time_DF['ratio_uncertainty_low'])&(tolerance<touch_time_DF['ratio_uncertainty_upp'])]

DF_5=DF_3[DF_3['time']>1]


# DF_1=shot_distance_DF[(shot_distance_DF['ratio']*tolerance>=shot_distance_DF['ratio_uncertainty_low'])&(shot_distance_DF['ratio']*tolerance>=shot_distance_DF['ratio_uncertainty_upp'])]
# DF_2=shot_distance_DF[(shot_distance_DF['ratio']*tolerance<shot_distance_DF['ratio_uncertainty_low'])&(shot_distance_DF['ratio']*tolerance<shot_distance_DF['ratio_uncertainty_upp'])]

# DF_3=touch_time_DF[(touch_time_DF['ratio']*tolerance>=touch_time_DF['ratio_uncertainty_low'])&(touch_time_DF['ratio']*tolerance>=touch_time_DF['ratio_uncertainty_upp'])]
# DF_4=touch_time_DF[(touch_time_DF['ratio']*tolerance<touch_time_DF['ratio_uncertainty_low'])&(touch_time_DF['ratio']*tolerance<touch_time_DF['ratio_uncertainty_upp'])]

In [None]:
plt.figure(figsize=(20,10))

plt.scatter(DF_1.distance,DF_1.ratio, color='blue')
plt.errorbar(DF_1.distance,DF_1.ratio, yerr=[DF_1['ratio_uncertainty_low'],DF_1['ratio_uncertainty_upp']], color='blue')

plt.scatter(DF_2.distance,DF_2.ratio, color='red')
plt.errorbar(DF_2.distance,DF_2.ratio, yerr=[DF_2['ratio_uncertainty_low'],DF_2['ratio_uncertainty_upp']], color='red')

plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.ylabel('ratio (%)', **text_opts)
plt.xlabel('SHOT_DISTANCE', **text_opts)

plt.grid()

plt.title('FGM ratio per distance', **text_opts)

plt.legend([('uncertainties <= %s' %(tolerance*100) +' %'),('uncertainties > %s' %(tolerance*100) +' %')],markerscale=3, prop={'size':24})

The ratio of FGM vs SHOT_DISTANCE graph appears to show a S curve relation. We can try to fit a sigmoid equation on it. <br/> The fit will be done only on the data that satisfies the precision (arbitrary) demand, AKA blue points.

In [None]:
def sigmoid(x, x0, k, a, c):
     y = a / (1 + np.exp(-k*(x-x0))) + c
     return y

In [None]:
plt.figure(figsize=(20,10))

xdata = DF_1['distance']
ydata = DF_1['ratio']

popt, pcov = curve_fit(sigmoid, xdata, ydata)

x = np.linspace(0, 25, 50)
y = sigmoid(x, *popt)

plt.scatter(xdata, ydata, s=25)
plt.errorbar(xdata,ydata,yerr=[DF_1['ratio_uncertainty_low'],DF_1['ratio_uncertainty_upp']])
plt.plot(x,y, color='green', linewidth=5)
plt.legend(['fit','data'], markerscale=3, prop={'size':24})
plt.grid()

plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.ylabel('ratio (%)', **text_opts)
plt.xlabel('distance', **text_opts)

plt.title('distance with fit', **text_opts)

plt.tight_layout()

#########
#########

plt.figure(figsize=(20,10))

plt.title('residues of fit', **text_opts)

plt.scatter(xdata, ydata-sigmoid(xdata, *popt), s=25)
plt.errorbar(xdata, ydata-sigmoid(xdata, *popt),yerr=[DF_1['ratio_uncertainty_low'],DF_1['ratio_uncertainty_upp']])
plt.plot(xdata,np.zeros(len(xdata)), linewidth=5, color='black')
plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.grid()

plt.ylabel('residue', **text_opts)
plt.xlabel('distance', **text_opts)

plt.legend(['fit','residue'], markerscale=3, prop={'size':24})

plt.tight_layout()

The fit is surprisiling accurate for a solid distance interval as one can see in the residues plot. <br/> That makes sense, as longer distances are most likely "hail mary" plays. 

We can observe as well that for a close distance (<= 4) the accuracy is incridibly higher, and after a threshold, it doesnt really matter how further away a player is.

note: its unclear if distance is in meters or some other kind of metric such as feet

Lets look now at the TOUCH TIME feature

In [None]:
plt.figure(figsize=(20,10))

plt.scatter(DF_3.time,DF_3.ratio, color='blue')
plt.errorbar(DF_3.time,DF_3.ratio, yerr=[DF_3['ratio_uncertainty_low'],DF_3['ratio_uncertainty_upp']], color='blue')

plt.scatter(DF_4.time,DF_4.ratio, color='red')
plt.errorbar(DF_4.time,DF_4.ratio, yerr=[DF_4['ratio_uncertainty_low'],DF_4['ratio_uncertainty_upp']], color='red')

plt.grid()


plt.title('FGM ratio per touch time', **text_opts)

plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.ylabel('ratio (%)', **text_opts)
plt.xlabel('touch time (s)', **text_opts)

plt.legend([('uncertainty <= %s per cent of ratio' %(tolerance*100)),('uncertainty > %s per cent of ratio' %(tolerance*100))],markerscale=3, prop={'size':24})

At first glance it might look similar to the SHOT DISTANCE graph. <br/> But the higher ratio points are just a few, which might indicate they are outliers. Its quite possible that those points are high ratio because they are all close distance, and not because they are small touch time. Another variable (feature) might also be causing that behaviour. <br/> If we exclude those points, the behaviou we have is pretty much constant, which would make sense - a professional player shouldnt need time to make a throw.

In [None]:
plt.figure(figsize=(20,10))

xdata = DF_5['time']
ydata = DF_5['ratio']

#popt, pcov = curve_fit(sigmoid, xdata, ydata)
(a,b)=np.polyfit(xdata,ydata,1)

x=np.linspace(1,10,50)
y=x*a+b

plt.scatter(xdata, ydata, s=25)
plt.errorbar(xdata,ydata,yerr=[DF_5['ratio_uncertainty_low'],DF_5['ratio_uncertainty_upp']])
plt.plot(x,y, color='green', linewidth=5)
plt.legend(['fit','data'], markerscale=3, prop={'size':24})
plt.grid()

plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.ylabel('ratio (%)', **text_opts)
plt.xlabel('touch time (s)', **text_opts)

plt.title('touch time with fit', **text_opts)

plt.tight_layout()
######
######


plt.figure(figsize=(20,10))

plt.title('residues of fit', **text_opts)

plt.scatter(xdata, ydata-(xdata*a+b), s=25)
plt.errorbar(xdata, ydata-(xdata*a+b),yerr=[DF_5['ratio_uncertainty_low'],DF_5['ratio_uncertainty_upp']])
plt.plot(xdata,np.zeros(len(xdata)), linewidth=5, color='black')
plt.xticks(**text_opts)
plt.yticks(**text_opts)

plt.ylabel('ratio (%)', **text_opts)
plt.xlabel('touch time (s)', **text_opts)

plt.legend(['fit','residue'], markerscale=3, prop={'size':24})

plt.grid()

plt.tight_layout()

## Analysis

to be started.