# Introduction

This notebook will explore various shooting patterns of NBA players during the 2019-20 regular season. We will focus specifically on the profound use of the 3-point shot and the role it has played in the league's shift towards "positionless" basketball in recent years. We will create an assortment of plots to illustrate how players shot the ball this season and in the future, use this information to try to group players together based on their shooting skills. Hope you enjoy this visual analysis of shooting in the NBA in 2020!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Import Data

In [None]:
# Import per-game stats
per_game_stats = pd.read_csv('/kaggle/input/nba-per-game-stats-201920/nba_2020_per_game.csv', index_col = 0)

In [None]:
per_game_stats.head()

In [None]:
# Import shooting stats
shooting_stats = pd.read_csv('/kaggle/input/nba-per-game-stats-201920/nba_2020_shooting.csv', index_col = 0)

In [None]:
shooting_stats.head()

# Cleaning the Data

## Merging the 2 Data Frames into 1

In [None]:
stats = pd.concat([per_game_stats, shooting_stats.iloc[:, 6:]], axis = 1)

In [None]:
stats.head()

In [None]:
stats.shape

## Missing Values

In [None]:
stats.isnull().sum()

In our case, it probably makes the most sense to simply impute all of the missing values with 0, since it is likely that the reason that a particular shooting percentage (or total) is missing is that the player simply didn't register enough/any attempts in that category.

In [None]:
stats.fillna(0, inplace = True)
print("Missing values in the data: {}".format(stats.isnull().sum().sum()))

There are now no missing values in the data, so we can move forward.

## Multiple Rows for the Same Player

In some cases, a player will have more than one row in the data (this is due to players switching teams mid-season and having a row for each team they've been apart of). Here, we'll replace these multiple rows with a single row containing the averages across all of the player's teams. (We could make this a weighted average, but since we shouldn't really expect players' averages to change immensely from one team to another)

In [None]:
df = stats.groupby(level = 0).mean().round(3)

Looks like we lost our categorical Position and Team columns in the process (this has to do with how the apply function works on a groupby object). Let's get those columns back in order below.

In [None]:
cat_features = ['Pos', 'Tm']

In [None]:
df = stats.groupby(level = 0).apply(lambda x: x.iloc[0])[cat_features].merge(df, left_index = True, right_index = True)

In [None]:
df.head()

In [None]:
df.shape

We now have 529 rows, as opposed to the original 651.

## Clean 'Pos' Column

In [None]:
# A few of the positions are not as clean as we would like
df['Pos'].value_counts()

We want to classify each player as one of the 5 textbook positions: PG, SG, SF, PF, C. No split position markers. Let's parse out the "-PF", etc. attached to some of the Pos tags.

In [None]:
df['Pos'] = df['Pos'].apply(lambda x: x.split('-')[0])

In [None]:
df['Pos'].value_counts()

## Set Games and Minutes Played Requirement

We really only want to consider those players who played for a statistically significant amount of time this season. (These cutoffs are somewhat arbitrary, we just want to filter out players who had very limited playing time.)

In [None]:
games_req = 40
mins_req = 15

In [None]:
df = df.loc[df['G'] >= games_req].loc[df['MP'] >= mins_req]

## Overview of Cleaned Data

Now all of our stats are gathered in one final, cleaned data frame, and we can work directly with this single df from here on out. Let's quickly look over this df:

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
# Ouput cleaned data to new csv file
df.to_csv('/kaggle/working/nba_2020_stats_cleaned.csv')

# Visualizations

In [None]:
# Set style for ensuing plots
plt.style.use(['fivethirtyeight'])
sns.set_palette('hls')

Let's now do a quick visual overview of some of this season's shooting trends.

In [None]:
# How the top scorers got their buckets, broken down by shooting range
df_top_scorers = df.sort_values('PTS', ascending = False)[:20]
shooting_ranges = ['0-3 Proportion', '3-10 Proportion', '10-16 Proportion', '16-3P Proportion', '3P Proportion']
shooting_range_labels = [x.replace(' Proportion', '') for x in shooting_ranges]

fig, axes = plt.subplots(10, 2, figsize = (16, 40))

for i, ax in enumerate(axes.flatten()):
    ax.pie(x = df_top_scorers[shooting_ranges].iloc[i], labels = shooting_range_labels, autopct="%.1f%%", radius = 1.5)
    ax.set_title(df_top_scorers.index[i] + ' - ' + df_top_scorers['Pos'][i], fontsize = 16, pad = 40)
    
fig.subplots_adjust(hspace = 0.8);

In [None]:
# How (relatively) efficient these top scorers were, specifically from 3
fig, ax = plt.subplots(figsize = (18, 7))

ax.plot('FG%', data = df_top_scorers, marker = 'o', color = 'blue')
ax.plot('3P FG%', data = df_top_scorers, marker = 's', color = 'green')
ax.axhline(y = np.mean(df['FG%']), linestyle = 'dashed', color = 'lightblue', label = 'League Average FG%')
ax.axhline(y = np.mean(df['3P FG%']), linestyle = 'dashed', color = 'lightgreen', label = 'League Average eFG%')

ax.set_xlabel('Player', fontsize = 14, labelpad = 20)
ax.set_ylabel('FG% and 3P FG%', fontsize = 14, labelpad = 20)
ax.set_xticks(df_top_scorers.index)
ax.set_xticklabels(df_top_scorers.index, rotation = 90)
ax.set_title("Top Scorers Shooting Efficiency", fontsize = 20, pad = 20)
ax.legend(loc = 'upper left', bbox_to_anchor = (1.05, 1), frameon = True)

plt.axis('tight');

It almost looks like there's some sort of tradeoff between 3P FG% and overall FG%; those who shot at a high percentage from 3 rarely shot the best from the field overall, and vice versa. Maybe this is to be expected, since perhaps players who are better perimeter threats may see their overall FG% suffer a bit due to how many of their shots come from a long distance. Just how many 3s are some of these players taking? We'll look into that next.

In [None]:
df_3P_shooters = df[df['3PA'] >= 1]    # Set an attempts requirement

fig = px.scatter(df_3P_shooters, x = '3P Proportion', y = 'FG%', hover_data = [df_3P_shooters.index], color = 'Pos')
fig.update_layout(title_text = "Tradeoff between Chucking 3s and Overall FG%", title_x = 0.5)
fig.show();

Again, there appears to be a noticeable dropoff in field goal percentage when attempting lots of 3s. That is, some players are taking so many 3s that thier FG% as a whole ends up not looking all that impressive due to the fact that 3s are inherently less accurate shot attempts.

Here's a similar plot, illustrating a different metric, eFG%, rather than FG% (eFG% multiplies 3P FG% by 1.5, to account for the fact that 3 points is 1.5x as valuable as 2 points). Notice how the top scorers (the larger points) are located mostly in the upper center area of the plot. This emphasizes their versatility when it comes to scoring  and shows us why they are the star players of the league - they can be effective all around the court, setting them apart from other sets of players like 3-point specialists, relatively unskilled big men,  and the like. But more on different sets of players later...

In [None]:
fig = px.scatter(df_3P_shooters, x = '3P Proportion', y = 'eFG%', hover_data = [df_3P_shooters.index], size = 'PTS', color = 'Pos')
fig.update_layout(title_text = "Balancing Effect of eFG%", title_x = 0.5)
fig.show();

The plots above clearly show the effect the 3-point shot in today's NBA can have on a player's shooting profile; the eFG% metric has even become an almost universally known metric for defining player efficiency in response to the proliferation of 3s. Now let's take a closer look at the use (and sometimes misuse) of the 3-ball in particular.

In [None]:
fig, ax = plt.subplots(figsize = (16, 8))

df[['2PA', '3PA']].plot.kde(ax = ax)

ax.set_title("2-Point/3-Point Shots KDE", fontsize = 20, pad = 20)
ax.legend(loc = 'upper left', bbox_to_anchor = (1.05, 1), frameon = True)

plt.show();

The density curves for 2PA and 3PA are pretty close to each other here! Players are almost at the point where 3-point attempts match 2-point attempts - that is, almost every other shot they take is a 3 (almost). However, when it comes to other shot types, particularly mid-range shots, a whole different trend exists.

In [None]:
fig, ax = plt.subplots(figsize = (16, 8))

df[['0-3 Proportion', '3-10 Proportion', '10-16 Proportion', '16-3P Proportion', '3P Proportion']].plot.kde(ax = ax)

ax.set_title("Different Shot Ranges KDE", fontsize = 20, pad = 20)
ax.legend(loc = 'upper left', bbox_to_anchor = (1.05, 1), frameon = True)

plt.show();

Now, let's focus directly on which kinds of players are using and benefitting from the 3 the most. Are they all/mostly guards, as was the norm in the league for so many years? Or is it actually a mixed bag of players, a sign of the "positionless" revolution of recent seasons? 

In [None]:
# The league's very best 3-point shooters
fig, ax = plt.subplots(figsize = (16, 8))
ax.plot('3P FG%', data = df_3P_shooters.sort_values('3P FG%', ascending = False)[:10].iloc[::-1], marker = 'o', color = 'blue')

ax.set_xlabel('Player', fontsize = 14, labelpad = 20)
ax.set_ylabel('3P%', fontsize = 14, labelpad = 20)
ax.set_xticks(df_3P_shooters.index)
ax.set_xticklabels(df_3P_shooters.index, rotation = 90)
ax.set_title("Top 3-Point Shooters by Percentage", fontsize = 20, pad = 20)


plt.axis('tight');

In [None]:
# The league's very worst 3-point shooters
fig, ax = plt.subplots(figsize = (16, 8))

ax.plot('3P%', data = df_3P_shooters.sort_values('3P%', ascending = True)[:10].iloc[::-1], marker = 'o')
ax.set_xlabel('Player', fontsize = 14, labelpad = 20)
ax.set_ylabel('3P%', fontsize = 14, labelpad = 20)
ax.set_xticks(df_3P_shooters.index)
ax.set_xticklabels(df_3P_shooters.index, rotation = 90)
ax.set_title("Bottom 3-Point Shooters by Percentage", fontsize = 20, pad = 20)

plt.axis('tight');

In [None]:
fig, ax = plt.subplots(figsize = (16, 8))

ax.plot('3P%', data = df_3P_shooters.sort_values('3PA', ascending = False)[:25], marker = 'o', color = 'green')
ax.axhline(y = np.mean(df['3P%']), linestyle = 'dashed', label = "League Average 3P%", color = 'lightgreen')

ax.set_xlabel('Player', fontsize = 14, labelpad = 20)
ax.set_ylabel('3P%', fontsize = 14, labelpad = 20)
ax.set_xticks(df_3P_shooters.index)
ax.set_xticklabels(df_3P_shooters.index, rotation = 90)
ax.set_title("Highest Volume 3-Point Shooters", fontsize = 20, pad = 20)
ax.legend(loc = 'upper left', bbox_to_anchor = (1.05, 1), frameon = True)

plt.axis('tight');

In [None]:
plt.figure()

fig = px.scatter(df_3P_shooters, x = '3PA', y = '3P%', hover_data = [df_3P_shooters.index], size = 'FGA', color = 'Pos')
fig.update_layout(title_text = "3-Point Percentage versus 3-Point Attempts", title_x = 0.5)
fig.show();

Here, at a glance, we see a couple noteworthy things. We can notice that the league truly is shifting towards being "positionless" (at least in the case of 3-pointers). While a lot of centers still don't shoot the 3 too much, it is still a blurred line in 3-point attempts and percentages across the other 4 positions - look especially at the middle of the above plot, around 2-5 3PA, where all of the positions are blended together. 

We can even notice the traditional positions' losing their differentiating power by looking at the several preceding plots. Quite a few of the league-leading 3-point shooters are not your traditional smaller guards: look at Robinson(SF), Bertans(PF), McDermott(PF), Millsap(PF), and so on. (Side note: conversely, many of the worst shooters are actually guards)

In [None]:
df_by_pos = df.groupby('Pos').apply(np.mean)

fig, ax = plt.subplots(figsize = (16, 8))

ax = df_by_pos['3PA'].plot.bar()
ax.set_xlabel('Position', fontsize = 14, labelpad = 20)
ax.set_title("3PA by Position", fontsize = 20, pad = 20)

plt.show();

It's becoming pretty difficult to discern between different positions looking at the shooting numbers. Imagine looking at this plot and trying to denote which bar corresponded to which position. Pretty difficult, apart from the Center bar. And isn't shooting arguably the most defining statistic? 

In the book, *Basketball on Paper* (which is basketball's version of Moneyball and an excellent book), Dean Oliver identified what he called the "Four Factors of Basketball Success":

- Shooting (40%)
- Turnovers (25%)
- Rebounding (20%)
- Free Throws (15%)

Shooting is the most important factor, followed by turnovers, rebounding, and free throws. The "Four Factors" were based on Oliver's extension research of the stats behind winning teams. He claims that shooting is the most important factor.

(source: https://www.breakthroughbasketball.com/stats/effective-field-goal-percentage.html)

On the other hand, consider how these similar plots - using two other stats mentioned in Oliver's four factors - may be used to distinguish across different positions:

In [None]:
df_by_pos = df.groupby('Pos').apply(np.mean)

fig, axes = plt.subplots(1, 2, figsize = (18, 8))

df_by_pos['TRB'].plot.bar(ax=axes[0])
axes[0].set_xlabel('Position', fontsize = 14, labelpad = 20)
axes[0].set_title("Rebounds by Position", fontsize = 20, pad = 20)

df_by_pos['TOV'].plot.bar(ax=axes[1])
axes[1].set_xlabel('Position', fontsize = 14, labelpad = 20)
axes[1].set_title("Turnovers by Position", fontsize = 20, pad = 20)


Not perfect, but definitely more informative than the 3-point chart. Someone with moderate basketball knowledge could easily determine which bars correspond to the centers and power forwards in the rebounding chart, and the turnovers chart, if nothing else, show a clear gap between point guards and non-PGs.

In [None]:
fig, ax = plt.subplots(figsize = (16, 8))

df_by_pos[['0-3 Proportion', '3-10 Proportion', '10-16 Proportion', '16-3P Proportion', '3P Proportion']].plot.bar(stacked = True, ax = ax)

ax.set_xlabel('Position', fontsize = 14, labelpad = 20)
ax.set_title("FGA Distribution by Position", fontsize = 20, pad = 20)
ax.legend(loc = 'upper left', bbox_to_anchor = (1.05, 1), frameon = True)

plt.show();

Having all of the shooting proportions doesn't exactly help us differentiate positions either. This final plot just serves to confirm the (partial) "positionlessness" of the NBA - a very mixed profile indeed. If only we had a better way to differentiate between unique sets of players...

# Conclusion

In summary, with the exception of the center position, positions really become blurred together when it comes to shot selection. The 3 has become so ingrained into the modern game that the traditional positions don't really encapsulate the different classes of players anymore. For example, look at these players who are both classified as power forwards: 
- Davis Bertans, who shot 8.7 of his 11.3 shots/game from 3-point range
- Domantas Sabonis, who shot 1.1 of his 13.7 shots/game from 3

Or, look at these rival, star point guards:
- Damian Lillard - 10.2 of 20.4 shots/game from 3
- Russell Westbrook - 3.7 of 22.5 shots/game from 3

So does it really make much sense to lump these sets of highly contrasting players into the same position when the way they score - and fundamentally, the way they play - is so different? To me, a position is meant to label a player's style of playing and how they contribute to the team. Think about positions in football (soccer): the positions roughly denote the location on the pitch that a player predominantly fills and controls. If we really think about it, this already exist at some level in basketball; some players are pretty exclusively 3-point shooters, waiting on the perimeter for spot-up looks, other players might be purely interior lob threats, etc. 

I think it would be neat to consider a roughly equivalent position system in basketball. That leads us to the next part of this project, which aims to boil down the different shooting tendencies of players into new categories, redefining the notion of positions in NBA basketball.