# Pythagorean Expectation and the NBA

The NBA is the most popular basketball league in the world, and consists of 30 teams playing an 82 game regular season followed by playoffs to determine the champion. In terms of scale, this data looks much more like MLB data than the IPL data we just looked at. 

Basketball resembles cricket in one way - the scores are much higher than in baseball. However, the points difference between winning and losing teams tend to be relatively small.

Let's see what we find this time. We follow the same procedure.

In [None]:
# Load the packages

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the data and see what it looks like

NBA = pd.read_csv('../../Data/Week 1/NBA_Games.csv')

NBA

In [None]:
# The data consists of games played between 2013 and 2019. An important difference from the baseball and cricket data
# is that here each game appears in two rows, one for each team. Each pair of rows are mirror images of each other.

# The season is identified by the column SEASON_ID
# Pre-season games have the prefix "1" before the year, regular season games have the prefix "2" 
# and postseason games have the prefix "4". We are going to look at the 2018 regular season and therefore
# want games with the prefix "2"
# We can use the command ".describe()" to obtain descriptive statistics for our variables.

NBAR18 = NBA[NBA.SEASON_ID == 22018]
NBAR18.describe()

In [None]:
# We can list all the variable names

print(NBAR18.columns.tolist())

In [None]:
# Many datasets contain missing variables. Missing variables in a column will usually cause operations to fail. 
# The command ".dropna()" will eliminate missing variables. 
# Compare the counts of variables below after the .dropna() below to the counts in the cell above.

NBAR18 = NBAR18.dropna()
NBAR18.describe()

In [None]:
# The game result is the column labeled 'WL'. We create a variable which has a value of '1' if the team won, and zero if it lost.
# This type of variable, where a condition (here winning) is either true (1) or not true (0) is called a "dummy variable".
# We will encounter them frequently.

NBAR18['result'] = np.where(NBAR18['WL']== 'W',1,0)
NBAR18.describe()

In [None]:
# For the Pythagorean Expectation we need only the result, points scored (PTS) and point conceded (PTSAGN).

NBAteams18 = NBAR18.groupby('TEAM_NAME')['result','PTS','PTSAGN'].sum().reset_index()
NBAteams18

In [None]:
# So now we can create the value for win percentage for each team in the 82 game season, and the Pythagorean Expectation.

NBAteams18['wpc'] = NBAteams18['result']/82
NBAteams18['pyth'] = NBAteams18['PTS']**2/(NBAteams18['PTS']**2 + NBAteams18['PTSAGN']**2)
NBAteams18

In [None]:
# We now plot the data. Our results look very similar to the MLB case.

sns.relplot(x="pyth", y="wpc", data = NBAteams18)

## Self test

run sns.relplot again, but this time write y="result" instead of y="wpc". What do you find? Does it make a difference?

In [None]:
# Finally we run the regression: wpc = Intercept + coef x pyth
# The coefficient of the variable pyth is strongly significant, and the R-Squared of the regression is close to 100%.

pyth_lm = smf.ols(formula = 'wpc ~ pyth', data=NBAteams18).fit()
pyth_lm.summary()

## Self test

Run the regression above but instead write 'wpc ~ result' instead of 'wpc ~ pyth' in the line starting pyth_lm. What difference does this make?

# Conclusion

We have found that the Pythagorean model fits the NBA data in roughly same way as it fits the MLB data. Let's now look at fourth example: English Premier League soccer.
