# NBA Data Analysis

The data used in this notebook was downloaded from [Kaggle](https://www.kaggle.com/drgilermo/nba-players-stats#Seasons_Stats.csv).  The original source of the data is [Basketball-reference](http://www.basketball-reference.com/).


## General Intro EDA

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np

from scipy import stats
import statsmodels.api as sm

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://docs.google.com/spreadsheets/d/1m0jaYL1KGjxW1cKJUQxVTcPOnm7v7NZEBKRZADCmc68/export?format=csv"
nba = pd.read_csv(data_url)
nba.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,176.0,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,109.0,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,140.0,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,20.0,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,20.0,,,,27.0,59.0


<IPython.core.display.Javascript object>

Looks like a lot of nulls, which columns are the biggest offenders?

We can definitely remove the 100% missing columns.

If we dropped all missing values from the data frame, how much data would we lose?
* Number of rows?
* Percent of rows?

We might instead look at columns we consider crucial to our analysis and drop where those are null.  Let's say for our made up analysis we need Year, Player, Pos, & Tm.

If we only drop rows missing values in these columns, how much data would we lose?
* Number of rows?
* Percent of rows?

In [None]:
crucial_cols = ["Year", "Player", "Pos", "Tm"]


If the impact for dropping NAs based on these columns is low, perform the drop:

Make a scatterplot of AST by PTS and color by Year.

* What do you conclude?
* What is an issue with this plot?

Make a plot showing the trend of median PTS and AST by year.

I'd advise to reshape the data to accomplish this.

* List the unique positions.  Which occur the most? the least?
* Filter the dataframe to 'pure' positions (i.e. only keep those like `'PG'` and drop the combined ones like `'PG-SG'`.

Let's say we want to know if the number of assists changes based on position.  Create a plot to help us begin investigating this.

It looks like we're pretty safe in saying the point guards get more assists.  Perform a statistical test to confirm this difference.  What test might we use and what do we conclude?

We can also take a step back from our pretty focused analysis (i.e. we've been choosing 1 or 2 variables to look at).  We could instead look at the big picture using something like a heatmap to see what variables are correlated.

In the heatmap, I've intentionally set the range of colors to go from -1 to 1 to map well to potential values of correlation coefficients.

Some things that stick out from this heatmap is the lack of correlation between BLK & AST and BLK & 3P.  Everything else is somewhat positively correlated with one another (this isn't too much of a surprise since they're mostly offensive statistics).  It is of note that PTS seems to be a little more positively correlated with TOV than with any of the other stats.

In [None]:
cols = ["3P", "PTS", "AST", "STL", "TOV", "BLK"]

# Use figure to make it bigger
plt.figure(figsize=(10, 8))
sns.heatmap(____, annot=True, vmin=-1, vmax=1)
plt.tight_layout()
plt.show()

We could also look at some off the court analysis.  For example, what's the most popular name in the NBA?

In [None]:
# Split into first and last name
nba["first_name"] = nba["Player"].str.split(" ").str[0]
nba["last_name"] = nba["Player"].str.split(" ").str[-1]

In [None]:
# Count the occurances of names
first_name_counts = nba["first_name"]._____
last_name_counts = nba["last_name"]._____

In [None]:
# Take top n
top_first_names = first_name_counts.____
top_last_names = last_name_counts.____

In [None]:
# Create bar plots
sns.____(x="____", y="____", data=____)
plt.show()

sns.____(x="____", y="____", data=____)
plt.show()

Plot bonus:

* Pretty up the above barplots.  Give them nice axes labels & titles.  Show these plots horizontally side by side.


----

Going to make up some stuff to get a contingency table example.  This NBA data doesn't lend itself too well since we pretty much only have numeric data.

Made up example: does handedness correlate with position?  We'll make up the handedness data.

In [None]:
# Removing hyphenated positions
pure_pos = nba[~nba["Pos"].str.contains("-")].copy()

# Randomly assign handedness as left/right
pure_pos["handedness"] = np.random.choice(["L", "R"], pure_pos.shape[0])

crosstab = pd.crosstab(pure_pos["handedness"], pure_pos["Pos"])
crosstab

Nothing really stands out, we could run a $\chi^2$ (chi square) test to be more formal.  The null hypothesis of a $\chi^2$ test is that there is no relationship between the variables.   Below we see that our pvalue is not below our 5% threshold so we fail to reject the null and conclude that we don't see a relationship between position and handedness.

In [None]:
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
p

Since we're making up the data anyways, we could manufacture some differences.

In [None]:
# Removing hyphenated positions
pure_pos = nba[~nba["Pos"].str.contains("-")].copy()

# Randomly assign handedness as left/right
pure_pos["handedness"] = np.random.choice(["L", "R"], pure_pos.shape[0])

# Hard code all pg to be righties
# Hard code all sg to be lefties
pure_pos.loc[pure_pos["Pos"] == "PG", "handedness"] = "R"
pure_pos.loc[pure_pos["Pos"] == "SG", "handedness"] = "L"

Repeat the $\chi^2$ analysis with this new data 