In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['xtick.labelsize'] = '16'
plt.rcParams['ytick.labelsize'] = '16'
plt.rcParams['axes.labelsize'] = '18'
plt.rcParams['axes.titlesize'] = '18'

nba_team_stats = pd.read_csv('2023-2024_nba_team_stats_per_game.csv')
nba_season_stats = pd.read_csv('2023-2024_season_stats.csv')

# Bringing it all together

Over the last few weeks, we have spent a lot of time learning how to read in datasets, make hypotheses, and test them with the data. Today, we're going to go through and review everything that we have covered, using the NBA 2023-2024 season team stats. I have two dataframes, one of which is a bunch of "per game" stats for every team, and another which is the end of the season standings. Let's take a look at these:

In [None]:
nba_team_stats

In [None]:
nba_season_stats

At the end of the day, wins are the most important stat that we care about--we want the fans to get to see their team winning, and we're going to investigate how different team stats correlated with wins. But the stats we care about are in one dataframe, and the wins are in another dataframe. In order to compare these, we need to combine these data products. The way that we can do this is with a function in pandas called "merge". To merge two dataframes, you give the function the two data frames you care about and tell it which column you would like it to combine on. In this case, we want to match on the "Team" column. Let's see below how this works:

In [None]:
nba_all = pd.merge(nba_team_stats, nba_season_stats, on='Team')

nba_all

In [None]:
nba_all.columns.values

Now, we have a dataframe that contains a ton of info about all the team stats per game. Let's ask the question: which stat is the most correlated with winning, and which is the least? To do this, let's use the scipy "linregress" function. First, before we systematically test this for everything, do we have a guess for what the answer is? Let's test that below to remind ourselves how to measure the correlation coefficient and the slope.

In [None]:
from scipy.stats import linregress

column_to_test = 'PTS'

fit = linregress(nba_all[column_to_test], nba_all['W'])

fit

And now, let's plot this, along with the best fitting line.

In [None]:
slope = fit.slope
intercept = fit.intercept

x_arr = np.linspace(np.min(nba_all[column_to_test]), np.max(nba_all[column_to_test]), 100)
wins_arr = slope * x_arr + intercept

plt.plot(nba_all[column_to_test], nba_all['W'], '.')
plt.plot(x_arr, wins_arr)
plt.xlabel(column_to_test)
plt.ylabel('Wins')

Our "fit" object also contains the spearman r value, which is our best measure of how correlated two variables are. Let's measure that below:

In [None]:
rvalue = fit.rvalue

print('Spearman r:', rvalue)

But how can we find out if this is the strongest correlation that exists in our data? We could go through individually and check every single column, but that would be really slow. We can instead employ a loop. Let's take a look at the full list of columns that we want to test. What we can do is do a "for" loop where we loop through each of those columns, and do whatever we want to that value. For a quick example, let's loop through and print the average value of every column we care about, in addition to the standard deviation. 

In [None]:
columns_to_test = ['FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
                   '2P', '2PA', '2P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST',
                   'STL', 'BLK', 'TOV', 'PF', 'PTS']

for column in columns_to_test:
    print(column, np.mean(nba_all[column]), np.std(nba_all[column]))

Let's try this again but make it a bit cleaner--to do this, we can use the python "round" function to only get a few significant digits for each of these numbers.

In [None]:
for column in columns_to_test:
    print(column, round(np.mean(nba_all[column]),2), round(np.std(nba_all[column]),2))

Looping through data is one of the easiest ways to make measurements for a bunch of different variables that you care about. We can do this again, where this time, we measure the spearman r value for every single statistic that we want to compare to wins. We can even go a step further. Let's loop through, plot every single correlation (with a best fitting line), and save the best fitting r value for every single value. If we want to save values from a loop, a good way to do this is to "append" those values to a list every time we move through the loop. We can see how that works below as we loop through.

In [None]:
r_list = []

for column in columns_to_test:

    fit = linregress(nba_all[column], nba_all['W'])
    slope = fit.slope
    intercept = fit.intercept
    rvalue = fit.rvalue

    #append the spearman r value to the list we've set up
    r_list.append(rvalue)
    
    x_arr = np.linspace(np.min(nba_all[column]), np.max(nba_all[column]), 100)
    wins_arr = slope * x_arr + intercept

    plt.plot(nba_all[column], nba_all['W'], '.')
    plt.plot(x_arr, wins_arr)
    plt.xlabel(column)
    plt.ylabel('Wins')
    plt.show()

#make r_list an array so it's easier to manipulate
r_list = np.array(r_list)

Now, let's take a look what our new r_list variable looks like.

In [None]:
r_list

So r_list is an array of all of the spearman r values that measured. This Now, we want to find out which variable has the strongest positive correlation. Let's take a look at the length of the array.

In [None]:
len(r_list)

And now let's take a look at the length of the columns that we looped through:

In [None]:
len(columns_to_test), columns_to_test

These have the same length, and they should, because we saved one value of r for every column that we tested. So every value in the r_list corresponds to the r value for the variable that is at the same index position in the columns_to_test list. So if we want to find the variable that is the most strongly correlated with wins, what we want to do is look for the maximum value in r_list. Let's see what that maximum is.

In [None]:
np.max(r_list)

Now we know that there is something in there that is strongly correlated with wins, and we want to find out what it is. To do this, we need to find the "argmax" of that variable--the index of the list that corresponds to the maximum value. We can do that with the np.argmax function.

In [None]:
np.argmax(r_list)

What that tells us is that the "5th" element of the list corresponds to the highest value. Because of the way arrays and lists work in python, we can easily see what that value is like so:

In [None]:
columns_to_test[np.argmax(r_list)]

Interesting! So we can actually see that the stat that is the most strongly correlated with winning isn't points/game, it's actually 3 point percentage! Let's take a look again at that correlation:

In [None]:
column_max = columns_to_test[np.argmax(r_list)]

fit = linregress(nba_all[column_max], nba_all['W'])
slope = fit.slope
intercept = fit.intercept
rvalue = fit.rvalue

x_arr = np.linspace(np.min(nba_all[column_max]), np.max(nba_all[column_max]), 100)
wins_arr = slope * x_arr + intercept

plt.plot(nba_all[column_max], nba_all['W'], '.')
plt.plot(x_arr, wins_arr)
plt.xlabel(column_max)
plt.ylabel('Wins')
plt.show()

It's interesting to see that even though there's only a small range in the total percentages teams shoot (the worst teams shoot 34%, the best teams shoot 39%), the correlation is so strong. Let's take a look at the slope and interpret it. How can we interpret this number?

In [None]:
slope

The slope has units of "wins per 3P%", so if we want to know how many wins shooting 1% better at the 3 point line corresponds to, we can multiply our slope by the value we want to see. Let's check to see how much of an increase in wins 1% at the free throw line corresponds to:

In [None]:
0.01 * slope

So shooting 1% better at the line corresponds to winning 8 more games in a season! That's a huge increase for what seems like such a small number, but when you're shooting 30-40 three point shots every game, that 1% difference corresponds to a lot of points on average!

# Activity: find out what the strongest *negative* correlation in the catalog is. Also, find out which statistic is the least correlated with wins.

After you find these, plot the relations.

- Hint: the opposite of "argmax" is "argmin"

- Hint: to find the least correlated variable, you want to find to find the value that is the closest to 0, regardless of sign. There are a few ways that you can do this, but one way is to think about the statistic we measured that is very closely related to the rvalue as well.

# Some other activities if you have time:

- Which team had the largest "point differential" (points/game - points against/game). Is this the same team that had the most wins? Is this new stat more strongly correlated with wins than 3 point percentage?

- How do "volume" stats like the number of 3 points attempted correlated with the 3 point percentage? Is that correlation the same as for 2 point shots? What about the number of total shot attempts?