In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

## Motivation
Since the Golden State Warriors, led by superstar Stephen Curry, won their NBA championship in 2015, there has been a increase in 3-pt shooting in the games. We now know that 3-pt shooting team can win championship, but we might want to know how important it is to have a high team 3-pt field goal percentage in order to win games. 


## Experiment Setup

Because I want to investigate the effect of 3pt shooting after GSW won their championship. I will only look at data after 2015 onward.

In [None]:
df = pd.read_csv('../input/nba-games/games.csv')
df = df[df.SEASON > 2014]
df.head()

Creation of treatment and control group:

- **Treatment** group includes data where a team won given it has higher 3pt field goal percentage (3pt-fg) than the other team
- **Control** group includes data where a team won given it has lower or equal 3pt field goal percentage (3pt-fg) than the other team

In [None]:
df = df[['FG3_PCT_home','FG3_PCT_away','HOME_TEAM_WINS']]
df['HOME_3PT'] = (df.FG3_PCT_home > df.FG3_PCT_away)*1
df

Treatment:
- there are 4826 instance in the treatment group
- Proportion of instance where a team that has a higher 3pt-fg than its opponent won is 56.75%

In [None]:
# (higher 3pt)
treatment = df.query("(HOME_TEAM_WINS == 1 & HOME_3PT == 1) | (HOME_TEAM_WINS == 0 & HOME_3PT == 0)")
print(treatment.shape[0])
n_treatment = (treatment['HOME_TEAM_WINS'] == 1).sum()
p_obs_treatment = n_treatment/treatment.shape[0]
p_obs_treatment

Control:
- there are 2045 instance in the control group
- Proportion of instance where a team that has a lower or same 3pt-fg than its opponent won is 59.8%

In [None]:
# (lower/equal 3pt)
control = df.query("(HOME_TEAM_WINS == 0 & HOME_3PT == 1) | (HOME_TEAM_WINS == 1 & HOME_3PT == 0)")
print(control.shape[0])
n_control = (control['HOME_TEAM_WINS'] == 1).sum()
p_obs_control = n_control/control.shape[0]
p_obs_control

## A/B Test
Assuming I have to consider the question whether higher team 3pt-fg percentage wins more games based on the observation data we obtained only at a Type I error rate of 5%, what would the null hypothesis be?

### Hypothsis Test
- $H_0$ = Higher team 3pt-fg percentage does not win more games. `P_higher_3pt_win - P_not_higher_win = 0`
- $H_1$ = Higher team 3pt-fg percentage wins more games. `P_higher_3pt_win > P_not_higher_win`

For this A/B test, under the assumption that $H_0 = true$, we want to know whether higher team 3pt-fg percentage wins more games?

We will use bootstrapping to simulate 10,000 samples. And for each of them, we will measure the difference of `P_higher_3pt_win` and `P_not_higher_win`

In [None]:
p_diffs = []
for _ in range(10000):
    higher_3pt_win = np.random.binomial(1,p_obs_treatment,treatment.shape[0])
    not_higher_win = np.random.binomial(1,p_obs_control,control.shape[0])
    p_diffs.append(higher_3pt_win.mean() - not_higher_win.mean())

`p_obs_diff` is the difference of `P_higher_3pt_win` and `P_not_higher_win` of our observation data above

In [None]:
p_obs_diff = p_obs_treatment - p_obs_control
p_obs_diff

In [None]:
p_diffs = np.array(p_diffs)
plt.hist(p_diffs)
plt.axvline(x = p_obs_diff,c='red')
plt.show()

At a Type I error rate of 5%, p-value is larger than the error rate, and therefore we **fail to reject** the Null hypothesis. There is a probability of 50.34% that team will lower or equal 3pt-fg as the opponent team wins the game.

In [None]:
p_value = (p_diffs > p_obs_diff).mean()
p_value