# NBA Shot Analysis

Analysis on what factors influence the success of a shot in an NBA game.

## Data Analysis Setup

Install necessary packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.partial_dependence import partial_dependence

Let's begin by reading in the dataset into a pandas dataframe, and viewing some of its contents.

In [None]:
data = pd.read_csv("../input/shot_logs.csv")
data.head()

## Data Cleaning

Reading through this metadata, it seems evident that some of these data features can be modified to yield more helpful information. Let's begin by viewing some basic information about our dataset.

In [None]:
data.info()

As one can see, we have 128069 samples in our dataset with 21 features. Of these 21 features, only the 'SHOT_CLOCK' feature has some missing data, but only for a few instances. 

The 'CLOSEST_DEFENDER_PLAYER_ID' and 'player_id' features will be ignored, with the preference being to identify players by their name instead. The 'GAME_ID' and 'player_name' features are also most likely unnecessary for this task, yielding specific informatiom about the performance in a particular game or for a particular player when our goal is to build a generalizable machine learning model. Therefore, we will drop these features to prevent overfitting.

In [None]:
data.drop(["CLOSEST_DEFENDER_PLAYER_ID", "GAME_ID", "player_name", "player_id"], axis=1, inplace=True)

We can see that the 'MATCHUP' feature yields the date as well as the corresponding home and away teams in a given match. Though knowing the year that a match occurred in is unnecessary due to the data from this dataset all coming from the 2015 season, and the exact day it occurred perhaps being a little too specific, taking note of the month a game occurred may be useful information to keep since it'll serve as an indication as to how deep into the NBA season this game occurred. Some players may perform better or worse as the season progresses. Knowing what teams were involved in this matchup may also be useful, since different teams have different styles of play, and this most likely factors into shot success. Let's use some NLP techniques to extract all this information from this feature's strings.

In [None]:
match_info = data["MATCHUP"].str.split(" - ", n=1, expand=True)
data["MONTH"] = pd.to_datetime(match_info[0]).dt.month
teams = match_info[1].str.split(" ", expand=True)
teams[0], teams[2] = np.where(teams[1] == "@", [teams[2], teams[0]], [teams[0], teams[2]])
data["HOME_TEAM"] = teams[0]
data["AWAY_TEAM"] = teams[2]
data["TEAM"] = np.where(data["LOCATION"] == "H", data["HOME_TEAM"], data["AWAY_TEAM"])
data["OPPOSING_TEAM"] = np.where(data["LOCATION"] == "H", data["AWAY_TEAM"], data["HOME_TEAM"])
data.drop(["MATCHUP", "HOME_TEAM", "AWAY_TEAM"], axis=1, inplace=True)

The 'GAME_CLOCK' feature indicates the amount of time remaining in a given quarter when a shot was taken. Let's convert this from the 'object' dtype into the total number of seconds remaining in the quarter with a new 'PERIOD_CLOCK' feature.

In [None]:
period_time = pd.to_datetime(data["GAME_CLOCK"], format="%M:%S")
data.drop("GAME_CLOCK", axis=1, inplace=True)
data["PERIOD_CLOCK"] = period_time.dt.minute*60 + period_time.dt.second

Analyzing the data, it seems that the features 'SHOT_RESULT' and 'FGM' yield the same information. Let's confirm this.

In [None]:
pd.get_dummies(data["SHOT_RESULT"])["made"].astype("bool").equals(data["FGM"].astype("bool"))

Since these two features yield exactly the same information, let's drop one of them.

In [None]:
data.drop("SHOT_RESULT", axis=1, inplace=True)

The 'PTS' column indicates how many points were scored for a given shot. Since this can be seen as simply the product of the 'PTS_TYPE' and 'FGM' features, we can safely drop this feature.

In [None]:
data.drop("PTS", axis=1, inplace=True)

The 'W' column indicates whether or not the shooting player's team won or loss the game. However, this information can be yielded, and with more information, from the 'FINAL_MARGIN' feature. Therefore, we shall drop the 'W' feature.

In [None]:
data.drop("W", axis=1, inplace=True)

## Exploratory Data Analysis

With our data cleaning procedure completed, let's view some statistical information about our data.

In [None]:
data.describe(include="all")

We can gain further insight of our dataset by visualizing its features. Let's begin by checking to see if we have a class imbalance.

In [None]:
ax = sns.countplot(x="FGM", data=data)
ax.set_xticklabels(["Missed", "Made"])
ax.set_xlabel("Shot")
ax.set_ylabel("Total")

As we can see, though more shots were missed than made in the dataset, the discrepancy isn't large enough to be of too much concern to us.

Let's now visualize how the features in our dataset are distributed, as well as their relationship with whether or not a shot is made.

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24,8))
sns.countplot(x="LOCATION", data=data, ax=axarr[0])
axarr[0].set_xticklabels(["Away", "Home"])
axarr[0].set_xlabel("Game Location")
axarr[0].set_ylabel("Total")
location = pd.crosstab(data["LOCATION"], data["FGM"]).reset_index()
location["Success_Rate"] = location[1] / (location[0] + location[1])
sns.barplot(location["LOCATION"], location["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[0].set_xticklabels(["Away", "Home"])
axarr[1].set_xlabel("Game Location")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24,8))
sns.distplot(data["FINAL_MARGIN"], kde=False, ax=axarr[0])
axarr[0].set_xlabel("Final Score Margin")
axarr[0].set_ylabel("Total")
final_margin = pd.crosstab(data["FINAL_MARGIN"], data["FGM"]).reset_index()
final_margin["Success_Rate"] = final_margin[1] / (final_margin[0] + final_margin[1])
sns.regplot(final_margin["FINAL_MARGIN"], final_margin["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Final Score Margin")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="SHOT_NUMBER", data=data, ax=axarr[0])
for label in axarr[0].xaxis.get_ticklabels()[::2]:
    label.set_visible(False)
axarr[0].set_xlabel("In-Game Shot Number")
axarr[0].set_ylabel("Total")
shot_number = pd.crosstab(data["SHOT_NUMBER"], data["FGM"]).reset_index()
shot_number["Success_Rate"] = shot_number[1] / (shot_number[0] + shot_number[1])
sns.regplot(shot_number["SHOT_NUMBER"], shot_number["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("In-Game Shot Number")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="PERIOD", data=data, ax=axarr[0])
axarr[0].set_xticklabels(["1", "2", "3", "4", "OT1", "OT2", "OT3"])
axarr[0].set_xlabel("Quarter")
axarr[0].set_ylabel("Total")
quarter = pd.crosstab(data["PERIOD"], data["FGM"]).reset_index()
quarter["Success_Rate"] = quarter[1] / (quarter[0] + quarter[1])
sns.barplot(quarter["PERIOD"], quarter["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Quarter")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.distplot(data["SHOT_CLOCK"].dropna(), kde=False, ax=axarr[0])
axarr[0].set_xlabel("Time Remaining on Shot Clock (s)")
axarr[0].set_ylabel("Total")
axarr[0].set_xlim((0, 24))
shot_clock = pd.crosstab(data["SHOT_CLOCK"], data["FGM"]).reset_index()
shot_clock["Success_Rate"] = shot_clock[1] / (shot_clock[0] + shot_clock[1])
sns.regplot(shot_clock["SHOT_CLOCK"], shot_clock["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].invert_xaxis()
axarr[1].set_xlabel("Time Remaining on Shot Clock (s)")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.distplot(data["DRIBBLES"], kde=False, ax=axarr[0])
axarr[0].set_xlabel("Number of Dribbles Prior to Shot")
axarr[0].set_ylabel("Total")
axarr[0].set_xlim((0, axarr[0].get_xlim()[1]))
dribbles = pd.crosstab(data["DRIBBLES"], data["FGM"]).reset_index()
dribbles["Success_Rate"] = dribbles[1] / (dribbles[0] + dribbles[1])
sns.regplot(dribbles["DRIBBLES"], dribbles["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Number of Dribbles Prior to Shot")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
data["TOUCH_TIME"] = data["TOUCH_TIME"].clip_lower(0)
sns.distplot(data["TOUCH_TIME"], kde=False, ax=axarr[0])
axarr[0].set_xlabel("Ball Possession Time Prior to Shot (s)")
axarr[0].set_ylabel("Total")
axarr[0].set_xlim((0, axarr[0].get_xlim()[1]))
touch_time = pd.crosstab(data["TOUCH_TIME"], data["FGM"]).reset_index()
touch_time["Success_Rate"] = touch_time[1] / (touch_time[0] + touch_time[1])
sns.regplot(touch_time["TOUCH_TIME"], touch_time["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Ball Possession Time Prior to Shot (s)")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.distplot(data["SHOT_DIST"], kde=False, ax=axarr[0])
axarr[0].set_xlabel("Shot Distance")
axarr[0].set_ylabel("Total")
axarr[0].set_xlim((0, axarr[0].get_xlim()[1]))
shot_distance = pd.crosstab(data["SHOT_DIST"], data["FGM"]).reset_index()
shot_distance["Success_Rate"] = shot_distance[1] / (shot_distance[0] + shot_distance[1])
sns.regplot(shot_distance["SHOT_DIST"], shot_distance["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Shot Distance")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
data["PTS_TYPE"] = data["PTS_TYPE"].clip_lower(0)
sns.countplot(x="PTS_TYPE", data=data, ax=axarr[0])
axarr[0].set_xticklabels(["Two Point", "Three Point"])
axarr[0].set_xlabel("Shot Type")
axarr[0].set_ylabel("Total")
shot_type = pd.crosstab(data["PTS_TYPE"], data["FGM"]).reset_index()
shot_type["Success_Rate"] = shot_type[1] / (shot_type[0] + shot_type[1])
sns.barplot(shot_type["PTS_TYPE"], shot_type["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Shot Type")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="CLOSEST_DEFENDER", data=data, ax=axarr[0])
axarr[0].get_xaxis().set_ticks([])
axarr[0].set_xticklabels("")
axarr[0].set_xlabel("Defender")
axarr[0].set_ylabel("Total")
defender = pd.crosstab(data["CLOSEST_DEFENDER"], data["FGM"]).reset_index()
defender["Success_Rate"] = defender[1] / (defender[0] + defender[1])
defender.sort_values("Success_Rate", inplace=True)
sns.barplot(defender["CLOSEST_DEFENDER"], 
            defender["Success_Rate"], 
            ax=axarr[1])
axarr[1].get_xaxis().set_ticks([])
axarr[1].set_xticklabels("")
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Defender")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.distplot(data["CLOSE_DEF_DIST"], kde=False, ax=axarr[0])
axarr[0].set_xlabel("Defender Distance from Shooter")
axarr[0].set_ylabel("Total")
axarr[0].set_xlim((0, axarr[0].get_xlim()[1]))
defender_distance = pd.crosstab(data["CLOSE_DEF_DIST"], data["FGM"]).reset_index()
defender_distance["Success_Rate"] = (defender_distance[1] / 
                                     (defender_distance[0] + 
                                      defender_distance[1]))
sns.regplot(defender_distance["CLOSE_DEF_DIST"], 
            defender_distance["Success_Rate"], 
            ax=axarr[1])
axarr[1].invert_xaxis()
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Defender Distance from Shooter")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="MONTH", data=data, order=[10, 11, 12, 1, 2, 3], ax=axarr[0])
axarr[0].set_xticklabels(["October", 
                          "November", 
                          "December", 
                          "January", 
                          "February", 
                          "March"])
axarr[0].set_xlabel("Month of Match")
axarr[0].set_ylabel("Total")
month = pd.crosstab(data["MONTH"], data["FGM"])
month["Success_Rate"] = month[1] / (month[0] + month[1])
month = month.reindex([10, 11, 12, 1, 2, 3]).reset_index().reset_index()
sns.barplot(month["index"], 
            month["Success_Rate"], 
            ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xticklabels(["October", 
                          "November", 
                          "December", 
                          "January", 
                          "February", 
                          "March"])
axarr[1].set_xlabel("Month of Match")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="TEAM", 
              data=data, 
              order=data["TEAM"].value_counts().sort_index().index, 
              ax=axarr[0])
axarr[0].set_xticklabels(labels=axarr[0].get_xticklabels(), rotation=90)
axarr[0].set_xlabel("Team")
axarr[0].set_ylabel("Total")
team = pd.crosstab(data["TEAM"], data["FGM"]).reset_index()
team["Success_Rate"] = team[1] / (team[0] + team[1])
team.sort_values(by="Success_Rate", inplace=True)
sns.barplot(team["TEAM"], team["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xticklabels(labels=axarr[1].get_xticklabels(), rotation=90)
axarr[1].set_xlabel("Team")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.countplot(x="OPPOSING_TEAM", 
              data=data, 
              order=data["OPPOSING_TEAM"].value_counts().sort_index().index, 
              ax=axarr[0])
axarr[0].set_xticklabels(labels=axarr[0].get_xticklabels(), rotation=90)
axarr[0].set_xlabel("Opposing Team")
axarr[0].set_ylabel("Total")
opponent = pd.crosstab(data["OPPOSING_TEAM"], data["FGM"]).reset_index()
opponent["Success_Rate"] = opponent[1] / (opponent[0] + opponent[1])
opponent.sort_values(by="Success_Rate", inplace=True)
sns.barplot(opponent["OPPOSING_TEAM"], opponent["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xticklabels(labels=axarr[1].get_xticklabels(), rotation=90)
axarr[1].set_xlabel("Opposing Team")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(24, 8))
sns.distplot(data["PERIOD_CLOCK"], kde=False, ax=axarr[0])
axarr[0].set_xlim((0, axarr[0].get_xlim()[1]))
axarr[0].set_xlabel("Time Remaining in Quarter (s)")
axarr[0].set_ylabel("Total")
quarter_clock = pd.crosstab(data["PERIOD_CLOCK"], data["FGM"]).reset_index()
quarter_clock["Success_Rate"] = quarter_clock[1] / (quarter_clock[0] + quarter_clock[1])
sns.regplot(quarter_clock["PERIOD_CLOCK"], quarter_clock["Success_Rate"], ax=axarr[1])
axarr[1].set_ylim((0, 1))
axarr[1].set_xlabel("Time Remaining in Quarter (s)")
axarr[1].set_ylabel("% of Shots Made")
fig.tight_layout()

## Machine Learning

As we can see from the above data exploration, the features that seem to correlate the most with shot success are the final score margin, in-game shot number, quarter, time remaining on the shot clock, number of dribbles prior to shot, ball possession time prior to shot, shot distance, shot type, and defender distance from shooter. Though different defenders seem to have varying impacts on shot success, due to the sheer number of classes in the 'CLOSEST_DEFENDER' feature, this feature will be ignored for simplicity.

Let's use the identified relevant features to train a machine learning model that can determine whether or not a shot will go in. We will replace all missing data in the shot clock feature with the feature's mean value and partition the dataset 80% for training and 20% for testing.

In [None]:
data["SHOT_CLOCK"].fillna(data["SHOT_CLOCK"].mean(), inplace=True)
ml_data = data[["FGM", 
                "FINAL_MARGIN", 
                "SHOT_NUMBER", 
                "PERIOD", 
                "SHOT_CLOCK", 
                "DRIBBLES", 
                "TOUCH_TIME", 
                "SHOT_DIST", 
                "PTS_TYPE", 
                "CLOSE_DEF_DIST"]]
X_data = ml_data.drop("FGM", axis=1)
y_data = ml_data["FGM"]
X_train, X_test, y_train, y_test = train_test_split(X_data, 
                                                    y_data, 
                                                    test_size=0.2, 
                                                    random_state=0)

We will use a gradient boosting classifier as the machine learning algorithm, due to its excellent performance in multiple domains as well as for its interpretability. We will do some simple cross-validation to see what would be a reasonable learning rate to use for our task.

In [None]:
cv_model = GridSearchCV(GradientBoostingClassifier(random_state=0), 
                        param_grid={"learning_rate": np.logspace(-3, -1, 3)}).fit(X_train, 
                                                                                  y_train)
cv_model.best_params_

As we can see, the gradient boosting classifier achieves its best results with a learning rate of 0.1, its default value. Let's now train our model using this learning rate and check to see how generalizable it is by viewing its accuracy score on the test set.

In [None]:
model = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
features = X_train.columns
feature_significance = pd.DataFrame({"Feature": features, 
                                     "Importance": model.feature_importances_})
feature_significance.sort_values("Importance", 
                                 ascending=False, 
                                 inplace=True)
feature_significance.reset_index(drop=True, inplace=True)
feature_significance.index += 1
feature_significance.at[1, "Feature"] = "Shot Distance"
feature_significance.at[2, "Feature"] = "Defender Distance from Shooter"
feature_significance.at[3, "Feature"] = "Ball Possession Time Prior to Shot (s)"
feature_significance.at[4, "Feature"] = "Final Margin"
feature_significance.at[5, "Feature"] = "Time Remaining on Shot Clock (s)"
feature_significance.at[6, "Feature"] = "Number of Dribbles Prior to Shot"
feature_significance.at[7, "Feature"] = "Shot Type"
feature_significance.at[8, "Feature"] = "Shot Number"
feature_significance.at[9, "Feature"] = "Quarter"
ax = sns.barplot(feature_significance["Importance"], feature_significance["Feature"], orient="h")

As we can see, shot distance seems to be by far the most important feature, followed by the defender distance from shooter.

We can view partial dependence plots of these features to further illustrate their influence on shot success.

In [None]:
fig, axarr =plt.subplots(3, 3, figsize=(24, 16))
for feature in range(len(features)):
    if feature < 3:
        axis = axarr[0][feature]
    elif feature < 6:
        axis = axarr[1][feature-3]
    else:
        axis = axarr[2][feature-6]
    sns.regplot(partial_dependence(model, 
                                   [feature], 
                                   X=X_train)[1][0], 
                partial_dependence(model, 
                                   [feature], 
                                   X=X_train)[0][0], 
                ax=axis)
    axis.set_ylim((-1, 1))
axarr[0][0].set_xlabel("Final Margin")
axarr[0][1].set_xlabel("In-Game Shot Number")
axarr[0][2].set_xlabel("Quarter")
axarr[1][0].invert_xaxis()
axarr[1][0].set_xlabel("Time Remaining on Shot Clock (s)")
axarr[1][1].set_xlabel("Number of Dribbles Prior to Shot")
axarr[1][2].set_xlabel("Ball Possession Time Prior to Shot (s)")
axarr[2][0].set_xlabel("Shot Distance")
axarr[2][1].set_xlabel("Shot Type")
axarr[2][2].set_xlabel("Defender Distance from Shooter")

One can see by viewing these plots that for an NBA player to increase his chances of making a shot, he should be as close as possible to the basket and as far away from any defenders as possible. The number of dribbles made prior to shooting, whether or not the shot is a three pointer or long two-pointer, as well as how many shots he has taken in the game and the quarter a shot is made in seem to have no meaningful impact on shot success.