# NBA Data Analysis

The data used in this notebook was downloaded from [Kaggle](https://www.kaggle.com/drgilermo/nba-players-stats#Seasons_Stats.csv).  The original source of the data is [Basketball-reference](http://www.basketball-reference.com/).


## General Intro EDA

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
import pandas as pd
import numpy as np

from scipy import stats

import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://docs.google.com/spreadsheets/d/1m0jaYL1KGjxW1cKJUQxVTcPOnm7v7NZEBKRZADCmc68/export?format=csv"
nba = pd.read_csv(data_url)
nba.head()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,...,0.705,,,,176.0,,,,217.0,458.0
1,1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,...,0.708,,,,109.0,,,,99.0,279.0
2,2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,...,0.698,,,,140.0,,,,192.0,438.0
3,3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,...,0.559,,,,20.0,,,,29.0,63.0
4,4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,...,0.548,,,,20.0,,,,27.0,59.0


<IPython.core.display.Javascript object>

Looks like a lot of nulls, which columns are the biggest offenders?

We can definitely remove the 100% missing columns.

If we dropped all missing values from the data frame, how much data would we lose?
* Number of rows?
* Percent of rows?

We might instead look at columns we consider crucial to our analysis and drop where those are null.  Let's say for our made up analysis we need Year, Player, Pos, & Tm.

If we only drop rows missing values in these columns, how much data would we lose?
* Number of rows?
* Percent of rows?

In [None]:
crucial_cols = ["Year", "Player", "Pos", "Tm"]


If the impact for dropping NAs based on these columns is low, perform the drop:

Make a scatterplot of AST by PTS and color by Year.

* What do you conclude?
* What is an issue with this plot?

Make a plot showing the trend of median PTS and AST by year.

I'd advise to reshape the data to accomplish this.

In [None]:
print(nba["Pos"].unique())

In [None]:
# The ~ symbol acts as a 'not' in this case.
# So we find all positions that contain - and then negate this filter.
pure_pos = nba[~nba["Pos"].str.contains("-")]
px.box(pure_pos, x="Pos", y="APG", hover_data=cols).update_xaxes(
    categoryorder="mean descending"
)

It looks like we're pretty safe in saying the point guards get more assists.  If we wanted to be more formal we could look into doing some statistical tests.  We're not exactly normal, so a non-parametric test would be better than a t-test.

In [None]:
pg = pure_pos.loc[pure_pos["Pos"] == "PG", "APG"]
sg = pure_pos.loc[pure_pos["Pos"] == "SG", "APG"]

fig, axes = plt.subplots(1, 2)
stats.probplot(pg, plot=axes[0])
stats.probplot(sg, plot=axes[1])
fig.tight_layout()
plt.show()

In [None]:
test_stat, p_value = stats.mannwhitneyu(pg, sg)

sg_pg_pure_pos = pure_pos[pure_pos["Pos"].isin(["PG", "SG"])]
sns.pointplot("Pos", "APG", data=sg_pg_pure_pos, join=False, order=["PG", "SG"])
plt.title(
    f"Mann Whitney p-value < 0.05 (p = {p_value:.4f}).\n"
    "We reject the null hypothesis that there is\n"
    "no difference in median."
)
plt.show()

We can also take a step back from our pretty focused analysis (i.e. we've been choosing 1 or 2 variables to look at).  We could instead look at the big picture using something like a heatmap to see what variables are correlated.

In the heatmap, I've intentionally set the range of colors to go from -1 to 1 to map well to potential values of correlation coefficients.

Some things that stick out from this heatmap is the lack of correlation between BLK & AST and BLK & 3P.  Everything else is somewhat positively correlated with one another (this isn't too much of a surprise since they're mostly offensive statistics).  It is of note that PTS seems to be a little more positively correlated with TOV than with any of the other stats.

In [None]:
cols = ["3P", "PTS", "AST", "STL", "TOV", "BLK"]

plt.figure(figsize=(10, 8))
sns.heatmap(nba[cols].corr(), annot=True, vmin=-1, vmax=1)
plt.tight_layout()
plt.show()

We could also look at some off the court analysis.  For example, what's the most popular name in the NBA?

In [None]:
nba["first_name"] = nba["Player"].str.split(" ").str[0]
nba["last_name"] = nba["Player"].str.split(" ").str[-1]

top_10_first_names = nba["first_name"].value_counts().iloc[:10]
top_10_first_names = top_10_first_names.reset_index()
sns.barplot(x="first_name", y="index", data=top_10_first_names)
plt.show()

top_10_last_names = nba["last_name"].value_counts().iloc[:10]
top_10_last_names = top_10_last_names.reset_index()
sns.barplot(x="last_name", y="index", data=top_10_last_names)
plt.tight_layout()
plt.show()

Going to make up an example to get a contingency table example.  The NBA data doesn't lend itself too well since we pretty much only have numeric data.

Made up example, does handedness correlate with position?  We'll make up some data for the example.

In [None]:
pure_pos = nba[~nba["Pos"].str.contains("-")].copy()
pure_pos["handedness"] = np.random.choice(["L", "R"], pure_pos.shape[0])

crosstab = pd.crosstab(pure_pos["handedness"], pure_pos["Pos"])
crosstab

Nothing really stands out, we could run a $\chi^2$ (chi square) test to be more formal.  The null hypothesis of a $\chi^2$ test is that there is no relationship between the variables.   Below we see that our pvalue is not below our 5% threshold so we fail to reject the null and conclude that we don't see a relationship between position and handedness.

In [None]:
chi2, p_value, _, _ = stats.chi2_contingency(crosstab)
print(f"p-value < 0.05? {p_value < 0.05} (p={p_value:.2f})")

Since we're making up the data anyways, we could manufacture some differences.

In [None]:
pure_pos = nba[~nba["Pos"].str.contains("-")].copy()
pure_pos["handedness"] = np.random.choice(["L", "R"], pure_pos.shape[0])

pure_pos.loc[pure_pos["Pos"] == "PG", "handedness"] = "R"
pure_pos.loc[pure_pos["Pos"] == "SG", "handedness"] = "L"

crosstab = pd.crosstab(pure_pos["handedness"], pure_pos["Pos"])
crosstab

In [None]:
chi2, p_value, _, _ = stats.chi2_contingency(crosstab)
print(f"p-value < 0.05? {p_value < 0.05} (p={p_value:.2f})")

OMG! We have a significant result now! We conclude that there is a relationship between handedness and position.