## Hypothesis Testing Assignment

In [1]:
#Importing necessary packages
import pandas as pd
import numpy as np
# package with hypothesis tests
import scipy.stats as st

### Data

You can download the data from [**here**](https://drive.google.com/file/d/19b9lHlkixZhs8yka8zV0QFieao66dUcY/view?usp=sharing). The data contains results of all NBA games from seasons 2013/2014 to 2015/2016.

In [278]:
# Load csv file

nba_csv = pd.read_csv('nba_games_2013_2015.csv', sep = ';')
nba_csv.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22015,1610612750,MIN,Minnesota Timberwolves,21501226,2016-04-13,MIN vs. NOP,W,240,144,...,0.826,5,38,43,41,14,8,13,20,35.0
1,22015,1610612749,MIL,Milwaukee Bucks,21501225,2016-04-13,MIL vs. IND,L,240,92,...,0.846,7,36,43,23,8,3,15,15,-5.0
2,22015,1610612738,BOS,Boston Celtics,21501217,2016-04-13,BOS vs. MIA,W,240,98,...,0.864,10,29,39,20,7,3,7,20,10.0
3,22015,1610612747,LAL,Los Angeles Lakers,21501228,2016-04-13,LAL vs. UTA,W,239,101,...,0.867,8,39,47,19,6,3,13,17,5.0
4,22015,1610612739,CLE,Cleveland Cavaliers,21501220,2016-04-13,CLE vs. DET,L,265,110,...,0.733,8,35,43,21,4,7,10,23,-2.0


--------------
### Task 1
Split the data into **3** separate dataframes for each NBA season!

In [199]:
nba_csv.groupby('SEASON_ID').size()

SEASON_ID
22013    2460
22014    2460
22015    2460
dtype: int64

In [200]:
# Split data into 3 dataframes based on NBA season

df_s22013 = nba_csv[nba_csv['SEASON_ID'] == 22013]
df_s22014 = nba_csv[nba_csv['SEASON_ID'] == 22014]
df_s22015 = nba_csv[nba_csv['SEASON_ID'] == 22015]

---------------
### Task 2
Test the hypothesis that the offensive productions stats of the Cleveland Cavaliers and Golden State Warriors (the teams that met in the finals that year) were from the same distribution in the 2015/2016 season.

Offensive production refers to two variables: **PTS (Points)** and **FG_PCT (Field Goal Percentage)**. We will need to do two separate hypothesis tests, one for each variable.

In [201]:
# Extract PTS data from both teams

CLE_PTS = nba_csv[nba_csv['TEAM_NAME'] == 'Cleveland Cavaliers']['PTS']
GSW_PTS = nba_csv[nba_csv['TEAM_NAME'] == 'Golden State Warriors']['PTS']

In [202]:
def check_hnull(pvalue, alpha = 0.05):
    if pvalue[1] <= alpha:
        print(f'Reject H null, p-value: {pvalue[1]:.3} <= alpha: {alpha}')
    else:
        print(f'Fail to reject H null, 0-value: {pvalue[1]:.3} > alpha: {alpha}')

In [238]:
def check_counts(a, b):
    if len(a) == len(b):
        print(f'Same counts: {len(a)}')
    else:
        print(f'Different counts: a = {len(a)}, b = {len(b)}')

In [204]:
# Test for normality

check_hnull(st.normaltest(CLE_PTS))
check_hnull(st.normaltest(GSW_PTS))

Reject H null, p-value: 0.022 <= alpha: 0.05
Fail to reject H null, 0-value: 0.506 > alpha: 0.05


In [235]:
check_counts(CLE_PTS, GSW_PTS)

Same counts: 246


In [205]:
# T test
# H null: 2 teams' PTS are from the same distribution

check_hnull(st.ttest_ind(CLE_PTS, GSW_PTS))

Reject H null, p-value: 6.04e-12 <= alpha: 0.05


In [214]:
# Extract FG_PCT data from both teams

CLE_FG_PCT = nba_csv[nba_csv['TEAM_NAME'] == 'Cleveland Cavaliers']['FG_PCT']
GSW_FG_PCT = nba_csv[nba_csv['TEAM_NAME'] == 'Golden State Warriors']['FG_PCT']

In [215]:
# Test for normality

check_hnull(st.normaltest(CLE_PTS))
check_hnull(st.normaltest(GSW_PTS))

Reject H null, p-value: 0.022 <= alpha: 0.05
Fail to reject H null, 0-value: 0.506 > alpha: 0.05


In [236]:
check_counts(CLE_FG_PCT, GSW_FG_PCT)

Same counts: 246


In [216]:
# T test
# H null: 2 teams' FG_PCT are from the same distribution

check_hnull(st.ttest_ind(CLE_FG_PCT, GSW_FG_PCT))

Reject H null, p-value: 2.7e-06 <= alpha: 0.05


-----------------
### Task 3
Test the hypothesis that the number of points (PTS) scored by Cleveland Cavaliers changed significantly after the head coach changed in the 2015/2016 season.

- **Coach Blatt was fired on 24th of Jan, 2016**. 

Use the data from seasons 2014/2015 and 2015/2016 - those are years when Cleveland was coached by Blatt. 

**We have two possible solutions to try here:**
- Take the same amount of games from before and after and try t-test.
- Take all the games from before and after and look for the right test to compare two samples with different sizes. (You will need to go through the scipy documentation or google to figure out what kind of test is required.)

In [233]:
b_fire1 = df_s22015[(df_s22015['TEAM_NAME'] == 'Cleveland Cavaliers') & (df_s22015['GAME_DATE'] < '2016-01-24')]['PTS']
b_fire2 = df_s22014[(df_s22014['TEAM_NAME'] == 'Cleveland Cavaliers') & (df_s22014['GAME_DATE'] < '2016-01-24')]['PTS']
b_fire = pd.concat([b_fire1, b_fire2])

a_fire = df_s22015[(df_s22015['TEAM_NAME'] == 'Cleveland Cavaliers') & (df_s22015['GAME_DATE'] >= '2016-01-24')]['PTS']

In [234]:
# Test for normality

check_hnull(st.normaltest(b_fire))
check_hnull(st.normaltest(a_fire))

# st.normaltest(b_fire)

Fail to reject H null, 0-value: 0.294 > alpha: 0.05
Fail to reject H null, 0-value: 0.457 > alpha: 0.05


In [239]:
check_counts(b_fire, a_fire)

Different counts: a = 124, b = 40


In [253]:
# T test for different variant counts

check_hnull(st.ttest_ind(b_fire, a_fire, equal_var = False))

Reject H null, p-value: 0.00329 <= alpha: 0.05


----------------


### Task 4
Download [**the similar dataset**](https://drive.google.com/file/d/1jY57bAOZp9y83b4W2PAoSH1uFARaxxls/view?usp=sharing) with scores from playoff games in 2016.

In [257]:
nbapf = pd.read_csv('nba_playoff_games_2016.csv', sep = ';')
nbapf.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,42015,1610612739,CLE,Cleveland Cavaliers,41500407,2016-06-19,CLE @ GSW,W,241,93,...,0.84,9,39,48,17,7,6,11,15,4.0
1,42015,1610612744,GSW,Golden State Warriors,41500407,2016-06-19,GSW vs. CLE,L,239,89,...,0.769,7,32,39,22,7,5,10,23,-4.0
2,42015,1610612744,GSW,Golden State Warriors,41500406,2016-06-16,GSW @ CLE,L,238,101,...,0.69,9,26,35,19,5,3,14,25,-14.0
3,42015,1610612739,CLE,Cleveland Cavaliers,41500406,2016-06-16,CLE vs. GSW,W,240,115,...,0.781,8,37,45,24,12,7,10,25,14.0
4,42015,1610612739,CLE,Cleveland Cavaliers,41500405,2016-06-13,CLE @ GSW,W,241,112,...,0.609,8,33,41,15,11,9,16,22,15.0


------------
### Task 5
Test the hypothesis that **number of blocks (BLK)** are from the same distribution in both the NBA playoffs and in the NBA regular season for 2015/2016 seaon for the **Toronto Raptors**. 

- We will be working with two samples with different sizes again.

In [290]:
nbar_blk = nba_csv[nba_csv['TEAM_NAME'] == 'Toronto Raptors']['BLK']
nbapf_blk = nbapf[nbapf['TEAM_NAME'] == 'Toronto Raptors']['BLK']

In [271]:
# Test normality

check_hnull(st.normaltest(nbar_blk))
check_hnull(st.normaltest(nbapf_blk))

Reject H null, p-value: 0.000295 <= alpha: 0.05
Fail to reject H null, 0-value: 0.993 > alpha: 0.05


In [273]:
# Test for variant counts

check_counts(nbar_blk, nbapf_blk)

Different counts: a = 246, b = 20


In [275]:
# T test with different variant counts

check_hnull(st.ttest_ind(nbar_blk, nbapf_blk, equal_var = False))

Reject H null, p-value: 0.0333 <= alpha: 0.05



-----------------
### Task 6
Test the hypothesis that the number of points (PTS) scored by Cleveland Cavaliers is equally distributed for all 3 seasons. 

- In this case, we need a hypothesis test that compares more than 2 distributions at the same. (You will need to go through the scipy documentation or google to figure out what kind of test is required.)

In [284]:
PTS_2013 = df_s22013[df_s22013['TEAM_NAME'] == 'Cleveland Cavaliers']['PTS']
PTS_2014 = df_s22014[df_s22014['TEAM_NAME'] == 'Cleveland Cavaliers']['PTS']
PTS_2015 = df_s22015[df_s22015['TEAM_NAME'] == 'Cleveland Cavaliers']['PTS']

In [286]:
# Check normality

check_hnull(st.normaltest(PTS_2013))
check_hnull(st.normaltest(PTS_2014))
check_hnull(st.normaltest(PTS_2015))

Fail to reject H null, 0-value: 0.233 > alpha: 0.05
Fail to reject H null, 0-value: 0.453 > alpha: 0.05
Fail to reject H null, 0-value: 0.311 > alpha: 0.05


In [287]:
# Check variants counts
print(len(PTS_2013))
print(len(PTS_2014))
print(len(PTS_2015))

82
82
82


In [292]:
check_hnull(st.f_oneway(PTS_2013, PTS_2014, PTS_2015))

Reject H null, p-value: 0.00309 <= alpha: 0.05


#### Follow Up
**Between which seasons can we see the significant difference?**

+ Unfortunatelly, this is not the output of an ANOVA test and further tests are needed to be run.
+ Note: Lebron James came back to the Cleveland Caveliers prior to season 2014/2015. We can use this fact to interpret our results correctly.