## Hypothesis Testing Assignment

In [1]:
#Importing necessary packages
import pandas as pd
import numpy as np
# package with hypothesis tests
import scipy.stats as st

### Data

You can download the data from [**here**](https://drive.google.com/file/d/0Bz9_0VdXvv9bUUNlUTVrMF9VcVU/view?usp=sharing&resourcekey=0-16O9Fc5eaJH99-M7AHqHOg). The data contains results of all NBA games from seasons 2013/2014 to 2015/2016.

In [2]:
df = pd.read_csv('nba_games_13_15.csv', sep=';')

In [3]:
df.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22015,1610612750,MIN,Minnesota Timberwolves,21501226,2016-04-13,MIN vs. NOP,W,240,144,...,0.826,5,38,43,41,14,8,13,20,35.0
1,22015,1610612749,MIL,Milwaukee Bucks,21501225,2016-04-13,MIL vs. IND,L,240,92,...,0.846,7,36,43,23,8,3,15,15,-5.0
2,22015,1610612738,BOS,Boston Celtics,21501217,2016-04-13,BOS vs. MIA,W,240,98,...,0.864,10,29,39,20,7,3,7,20,10.0
3,22015,1610612747,LAL,Los Angeles Lakers,21501228,2016-04-13,LAL vs. UTA,W,239,101,...,0.867,8,39,47,19,6,3,13,17,5.0
4,22015,1610612739,CLE,Cleveland Cavaliers,21501220,2016-04-13,CLE vs. DET,L,265,110,...,0.733,8,35,43,21,4,7,10,23,-2.0


In [4]:
df.columns

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'PTS', 'FGM', 'FGA', 'FG_PCT',
       'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB',
       'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PLUS_MINUS'],
      dtype='object')

--------------
### Task 1
Split the data into **3** separate dataframes for each NBA season!

In [4]:
# Split the DataFrame based on 'SEASON_ID'
df_2013 = df[df['SEASON_ID'] == 22013]
df_2014 = df[df['SEASON_ID'] == 22014]
df_2015 = df[df['SEASON_ID'] == 22015]

---------------
### Task 2
Test the hypothesis that the offensive productions stats of the Cleveland Cavaliers and Golden State Warriors (the teams that met in the finals that year) were from the same distribution in the 2015/2016 season.

Offensive production refers to two variables: **PTS (Points)** and **FG_PCT (Field Goal Percentage)**. We will need to do two separate hypothesis tests, one for each variable.

- Null Hypothesis (H0): The offensive production stats (PTS and FG_PCT) of the Cleveland Cavaliers and Golden State Warriors come from the same distribution in the 2015 season.
- Alternative Hypothesis (H1): The offensive production stats (PTS and FG_PCT) of the Cleveland Cavaliers and Golden State Warriors do not come from the same distribution in the 2015 season.

In [7]:
from scipy.stats import mannwhitneyu

# Filter data for the two teams
cavs = df_2015[df_2015['TEAM_NAME'] == 'Cleveland Cavaliers']
warriors = df_2015[df_2015['TEAM_NAME'] == 'Golden State Warriors']

# Perform Mann-Whitney U test for 'PTS'
stat, p = mannwhitneyu(cavs['PTS'], warriors['PTS'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

# Interpretation
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0) for PTS')
else:
    print('Different distribution (reject H0) for PTS')

# Perform Mann-Whitney U test for 'FG_PCT'
stat, p = mannwhitneyu(cavs['FG_PCT'], warriors['FG_PCT'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

# Interpretation
if p > alpha:
    print('Same distribution (fail to reject H0) for FG_PCT')
else:
    print('Different distribution (reject H0) for FG_PCT')

Statistics=1683.500, p=0.000
Different distribution (reject H0) for PTS
Statistics=2452.500, p=0.003
Different distribution (reject H0) for FG_PCT


Interpretation of the results:

1. For 'PTS' (Points):
   - The Mann-Whitney U statistic is 1683.500. This is the sum of the ranks in the two samples.
   - The p-value is 0.000, which is less than the alpha level of 0.05. This means rejecting the null hypothesis that the 'PTS' for the Cleveland Cavaliers and Golden State Warriors come from the same distribution in the 2015 season. The data suggests that the 'PTS' for the two teams come from different distributions.

2. For 'FG_PCT' (Field Goal Percentage):
   - The Mann-Whitney U statistic is 2452.500.
   - The p-value is 0.003, which is also less than the alpha level of 0.05. This means rejecting the null hypothesis that the 'FG_PCT' for the Cleveland Cavaliers and Golden State Warriors come from the same distribution in the 2015 season. The data suggests that the 'FG_PCT' for the two teams come from different distributions.

In conclusion, both the 'PTS' and 'FG_PCT' offensive production stats for the Cleveland Cavaliers and Golden State Warriors appear to come from different distributions in the 2015 season, according to the data and the Mann-Whitney U test.

-----------------
### Task 3
Test the hypothesis that the number of points (PTS) scored by Cleveland Cavaliers changed significantly after the head coach changed in the 2015/2016 season.

- **Coach Blatt was fired on 24th of Jan, 2016**. 

Use the data from seasons 2014/2015 and 2015/2016 - those are years when Cleveland was coached by Blatt. 

**We have two possible solutions to try here:**
- Take the same amount of games from before and after and try t-test.
- Take all the games from before and after and look for the right test to compare two samples with different sizes. (You will need to go through the scipy documentation or google to figure out what kind of test is required.)

In [7]:
from scipy.stats import ttest_ind

# Filter data for Cleveland Cavaliers
cavs = df[(df['TEAM_NAME'] == 'Cleveland Cavaliers') & ((df['SEASON_ID'] == 22014) | (df['SEASON_ID'] == 22015))]

# Convert GAME_DATE to datetime
cavs.loc[:, 'GAME_DATE'] = pd.to_datetime(cavs['GAME_DATE'])

# Define coach change date as Timestamp
coach_change_date = pd.to_datetime('2016-01-24')

# Split data based on the coach change date
before_coach_change = cavs[cavs['GAME_DATE'] < coach_change_date]
after_coach_change = cavs[cavs['GAME_DATE'] >= coach_change_date]

# Get the same number of games from before and after the coach change
min_games = min(len(before_coach_change), len(after_coach_change))
before_coach_change = before_coach_change.sort_values('GAME_DATE', ascending=False).head(min_games)
after_coach_change = after_coach_change.sort_values('GAME_DATE').head(min_games)

# Perform t-test
stat, p = ttest_ind(before_coach_change['PTS'], after_coach_change['PTS'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

# Interpretation
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0) for PTS')
else:
    print('Different distribution (reject H0) for PTS')

Statistics=-2.911, p=0.005
Different distribution (reject H0) for PTS


The `Statistics` value (-2.911) is the test statistic from the t-test. A negative value indicates that the first sample's mean (before the coach change) is less than the second sample's mean (after the coach change).

The `p` value (0.005) is the probability of observing a test statistic as extreme as the one obtained (or more extreme) if the null hypothesis (H0) is true. 

In this case, the null hypothesis (H0) is that the two distributions are the same, or in other words, the mean points scored (PTS) before and after the coach change are the same.

Because the p-value is less than the significance level (0.05), the null hypothesis is rejected. This means that there is a statistically significant difference in the mean points scored before and after the coach change.

So, the results suggest that the coach change had a significant effect on the number of points scored by the Cleveland Cavaliers.

----------------


### Task 4
Download [**the similar dataset**](https://drive.google.com/file/d/0Bz9_0VdXvv9bRHhuRTI1aXBQcTA/view?usp=sharing&resourcekey=0-jGIaWyk0bXyECNSzB3x38w) with scores from playoff games in 2016.

In [6]:
df_2016 = pd.read_csv('nba_playoff_16.csv', sep=';')
df_2016.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,42015,1610612739,CLE,Cleveland Cavaliers,41500407,2016-06-19,CLE @ GSW,W,241,93,...,0.84,9,39,48,17,7,6,11,15,4.0
1,42015,1610612744,GSW,Golden State Warriors,41500407,2016-06-19,GSW vs. CLE,L,239,89,...,0.769,7,32,39,22,7,5,10,23,-4.0
2,42015,1610612744,GSW,Golden State Warriors,41500406,2016-06-16,GSW @ CLE,L,238,101,...,0.69,9,26,35,19,5,3,14,25,-14.0
3,42015,1610612739,CLE,Cleveland Cavaliers,41500406,2016-06-16,CLE vs. GSW,W,240,115,...,0.781,8,37,45,24,12,7,10,25,14.0
4,42015,1610612739,CLE,Cleveland Cavaliers,41500405,2016-06-13,CLE @ GSW,W,241,112,...,0.609,8,33,41,15,11,9,16,22,15.0


------------
### Task 5
Test the hypothesis that **number of blocks (BLK)** are from the same distribution in both the NBA playoffs and in the NBA regular season for 2015/2016 season for the **Toronto Raptors**. 

- We will be working with two samples with different sizes again.

In [9]:
from scipy.stats import shapiro, levene

# Check for normality in df_2015
stat, p = shapiro(df_2015['BLK'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('df_2015 data looks Gaussian (fail to reject H0)')
else:
    print('df_2015 data does not look Gaussian (reject H0)')

# Check for normality in df_2016
stat, p = shapiro(df_2016['BLK'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('df_2016 data looks Gaussian (fail to reject H0)')
else:
    print('df_2016 data does not look Gaussian (reject H0)')

# Check for equal variances
stat, p = levene(df_2015['BLK'], df_2016['BLK'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Samples have equal variances (fail to reject H0)')
else:
    print('Samples do not have equal variances (reject H0)')

Statistics=0.965, p=0.000
df_2015 data does not look Gaussian (reject H0)
Statistics=0.960, p=0.000
df_2016 data does not look Gaussian (reject H0)
Statistics=0.007, p=0.933
Samples have equal variances (fail to reject H0)


The Shapiro-Wilk test was performed on the 'BLK' column of the df_2015 DataFrame to check if the data is normally distributed. The test statistic is 0.965 and the p-value is 0.000. Since the p-value is less than 0.05, we reject the null hypothesis (H0) that the data is normally distributed. Therefore, the 'BLK' data in df_2015 does not follow a normal distribution.

The same test was performed on the 'BLK' column of the df_2016 DataFrame. The test statistic is 0.960 and the p-value is 0.000. Again, since the p-value is less than 0.05, we reject the null hypothesis (H0) that the data is normally distributed. Therefore, the 'BLK' data in df_2016 does not follow a normal distribution either.

Levene's test was performed to check if the variances of the 'BLK' data in the df_2015 and df_2016 DataFrames are equal. The test statistic is 0.007 and the p-value is 0.933. Since the p-value is greater than 0.05, we fail to reject the null hypothesis (H0) that the variances are equal. Therefore, the 'BLK' data in df_2015 and df_2016 have equal variances.

In summary, both the 'BLK' data in df_2015 and df_2016 are not normally distributed, but they have equal variances.

In [10]:
from scipy.stats import mannwhitneyu

# Perform Mann-Whitney U test
stat, p = mannwhitneyu(df_2015['BLK'], df_2016['BLK'])

print('Statistics=%.3f, p=%.3f' % (stat, p))
# Interpretation
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0) for BLK')
else:
    print('Different distribution (reject H0) for BLK')

Statistics=211177.500, p=0.968
Same distribution (fail to reject H0) for BLK


The Mann-Whitney U test was performed on the 'BLK' column of the df_2015 and df_2016 DataFrames to check if the two independent samples were selected from populations having the same distribution.

The test statistic is 211177.500 and the p-value is 0.968. The p-value is greater than the significance level of 0.05, so we fail to reject the null hypothesis (H0).

This means that there is not enough evidence to conclude that the distributions of 'BLK' in df_2015 and df_2016 are different. In other words, the 'BLK' data in df_2015 and df_2016 come from populations with the same distribution.


-----------------
### Task 6
Test the hypothesis that the number of points (PTS) scored by Cleveland Cavaliers is equally distributed for all 3 seasons. 

- In this case, we need a hypothesis test that compares more than 2 distributions at the same. (You will need to go through the scipy documentation or google to figure out what kind of test is required.)

In [11]:
# Check for normality in df_2013
stat, p = shapiro(df_2013['PTS'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('df_2013 PTS data looks Gaussian (fail to reject H0)')
else:
    print('df_2013 PTS data does not look Gaussian (reject H0)')

# Check for normality in df_2014
stat, p = shapiro(df_2014['PTS'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('df_2014 PTS data looks Gaussian (fail to reject H0)')
else:
    print('df_2014 PTS data does not look Gaussian (reject H0)')

# Check for normality in df_2015
stat, p = shapiro(df_2015['PTS'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('df_2015 PTS data looks Gaussian (fail to reject H0)')
else:
    print('df_2015 PTS data does not look Gaussian (reject H0)')

Statistics=0.997, p=0.000
df_2013 PTS data does not look Gaussian (reject H0)
Statistics=0.997, p=0.000
df_2014 PTS data does not look Gaussian (reject H0)
Statistics=0.998, p=0.001
df_2015 PTS data does not look Gaussian (reject H0)


The Shapiro-Wilk test was performed on the 'PTS' column of the df_2013 DataFrame to check if the data is normally distributed. The test statistic is 0.997 and the p-value is 0.000. Since the p-value is less than 0.05, we reject the null hypothesis (H0) that the data is normally distributed. Therefore, the 'PTS' data in df_2013 does not follow a normal distribution.

The same test was performed on the 'PTS' column of the df_2014 DataFrame. The test statistic is 0.997 and the p-value is 0.000. Again, since the p-value is less than 0.05, we reject the null hypothesis (H0) that the data is normally distributed. Therefore, the 'PTS' data in df_2014 does not follow a normal distribution either.

The Shapiro-Wilk test was also performed on the 'PTS' column of the df_2015 DataFrame. The test statistic is 0.998 and the p-value is 0.001. Since the p-value is less than 0.05, we reject the null hypothesis (H0) that the data is normally distributed. Therefore, the 'PTS' data in df_2015 does not follow a normal distribution.

In summary, the 'PTS' data in all three DataFrames (df_2013, df_2014, df_2015) are not normally distributed.

In [8]:
from scipy.stats import kruskal

# Perform Kruskal-Wallis H test
stat, p = kruskal(df_2013['PTS'], df_2014['PTS'], df_2015['PTS'])

print('Statistics=%.3f, p=%.3f' % (stat, p))
# Interpretation
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0) for PTS')
else:
    print('Different distribution (reject H0) for PTS')

Statistics=62.612, p=0.000
Different distribution (reject H0) for PTS


The Kruskal-Wallis H test was performed on the 'PTS' column of the df_2013, df_2014, and df_2015 DataFrames to check if the three independent samples were selected from populations having the same distribution.

The test statistic is 62.612 and the p-value is 0.000. The p-value is less than the significance level of 0.05, so we reject the null hypothesis (H0).

This means that there is enough evidence to conclude that the distributions of 'PTS' in df_2013, df_2014, and df_2015 are different. In other words, the 'PTS' data in df_2013, df_2014, and df_2015 do not come from populations with the same distribution.

#### Follow Up
**Between which seasons can we see the significant difference?**

+ Unfortunatelly, this is not the output of an ANOVA test and further tests are needed to be run.
+ Note: Lebron James came back to the Cleveland Caveliers prior to season 2014/2015. We can use this fact to interpret our results correctly.

In [11]:
from scikit_posthocs import posthoc_dunn

# Create a DataFrame where each group is a separate column
data = pd.DataFrame({'2013': df_2013['PTS'], '2014': df_2014['PTS'], '2015': df_2015['PTS']})

# Perform Dunn's test
posthoc = posthoc_dunn(a=data, val_col='value_column_name', group_col='group_column_name')

print(posthoc)

ValueError: Specify correct column names using `group_col` and `val_col` args

In [13]:
print(data)

       2013  2014   2015
0       NaN   NaN  144.0
1       NaN   NaN   92.0
2       NaN   NaN   98.0
3       NaN   NaN  101.0
4       NaN   NaN  110.0
...     ...   ...    ...
7375   87.0   NaN    NaN
7376  107.0   NaN    NaN
7377  116.0   NaN    NaN
7378   95.0   NaN    NaN
7379   97.0   NaN    NaN

[7380 rows x 3 columns]
