<h1 align="left">**Hypothesis Testing** </h1>
<br>
<img src="../images/hypothesis.jpg" alt="Python" style="width: 600px;"/>

## John's Curiosity of OldTown 
***
 - Now, John is interested in dissecting the whole Real Estate Scene in OldTown
 
 - He's interested in seeing whether the prices of houses are different (on an average) when compared to the rest of Brooklyn 
 
 - John has just conjured up what is famously known as a "Hypothesis" 

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

## Hypothesis 
***
A statement that might be true, which can then be tested.

Example: Sam has a hypothesis that "large dogs are better at catching tennis balls than small dogs". We can test that hypothesis by having hundreds of different sized dogs try to catch tennis balls.

- The beauty of these Hypotheses are that they can be TESTED! 

## Hypothesis Testing 
***
- Statistical hypothesis tests are based a statement called the null hypothesis that assumes nothing interesting is going on between whatever variables you are testing. 
 
 - Therefore, in John's case the Null Hypothesis is that:
     - "The Mean of House Prices in OldTown is not different from the Houses all over Brooklyn
     
 ### Why Null Hypothesis? 
 - The purpose of a hypothesis test is to determine whether the null hypothesis is likely to be true given sample data.
 - If there is little evidence against the null hypothesis given the data, you accept the null hypothesis.
 - If the null hypothesis is unlikely given the data, you might reject the null in favor of the alternative hypothesis: that something interesting is going on.
 
 ### Alternative Hypothesis
 - This is nothing but the question you ask which kind of "opposes" the Null Hypothesis
 
 - Therefore, in John's case the Alternative Hypothesis is that:
     - "The Mean of House Prices in OldTown **IS** different from the Houses all over Brooklyn
     
 - Only 1 Hypothesis can be right
 
 - In hypothesis testing we test a sample, with the goal of accepting or rejecting a null hypothesis which is our assumption or the default position. The test tells us whether or not our primary hypothesis is true.

## Important
***

### The null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis

 - We require a standard on the available evidence to reject the null hypothesis (convict)


If we set a low standard
, then we would increase the percentage of innocent people convicted
; however we would also increase the percentage of guilty people convicted
(correctly rejecting the null)


If we set a high standard, then we increase the the percentage of innocent people let free
 while we would also increase the percentage of guilty people let free
(type II errors)

<img src="../images/icon/Maths-Insight.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
## Math behind Hypothesis Testing
***
Once you have the null and alternative hypothesis in hand, you choose a significance level (often denoted by the Greek letter α). 

 - The significance level is a probability threshold that determines when you reject the null hypothesis.
 <img src="../images/sig.jpg" alt="Significance Level" style="width: 600px;float:left; margin-right:15px"/>


- So we use this to calculate a "Test Statistic" that would further help us do further calculations
 ***
 <center><img src="../images/hyp1.png" alt="Drawing" style="width: 350px;"/></center> 

 - After carrying out a test, if the probability of getting a result as extreme as the one you observe due to chance is lower than the significance level, you reject the null hypothesis in favor of the alternative. 
 
 - This probability of seeing a result as extreme or more extreme than the one observed is known as the *p-value*.

## Interpretation of p-value
***
- The p-value is really not as complicated as people make it sound

- So now say that we have put a significance (alpha) = 0.05
    - This means that if we see a p-value of lesser than 0.05, we reject our Null and accept the Alternative to be true 
    
    - What you have to understand is the data from your Null hypothesis follows a distribution (Normally distributed) 
     - Just imagine 1 bell curve of the data from the Null Hypothesis
     - Now imagine another bell curve which hypothetically defines your Alternative Hypothesis
     
     See below: 
        
 ***
 <center><img src="../images/pval1.png" alt="Drawing" style="width: 350px;"/></center> 

***
 - So what the p-value says is that it is the Probability of finding the Alternative Hypothesis data in the Null Hypothesis data (bell curve 1!!) 
 
 - If it is lesser than 0.05(our threshold) then we reject it

***
Why reject it though? 

 - BECAUSE OUR ALTERNATIVE HYPOTHESIS DATA IS REAL! NO ONE MADE IT UP! IT IS LEGIT DATA THAT IS OBSERVED AND NOT JUST FAKE! 
 - So if it is real, we can say that such data isn't really described by the Null Hypothesis (Bell Curve 1) therefore the Null must be rejected as being TRUE! 
 
 - It now makes sense! P-values are cool again 

## Well....

 ***
 <center><img src="../images/pval_meme.png" alt="Drawing" style="width: 350px;"/></center>

## Coming back to John 
***
 - Let's see how John did now
 - Are house prices in OldTown really different from the House Prices of Brooklyn? 
 
 

In [2]:
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv("../train.csv")
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
from statsmodels.stats.weightstats import ztest
z_statistic, p_value = ztest(x1=data[data['Neighborhood'] == 'OldTown']['SalePrice'], value=data['SalePrice'].mean())
print('Z-statistic is :{}'.format(z_statistic))
print('P-value is :{}'.format(p_value))

Z-statistic is :-10.639294263334575
P-value is :1.9560526026260018e-26


## Summary of the p-value
***
* When performing a hypothesis test, the p-value is the probability of given or more extreme outcome given the null-hypothesis is true.

* We see that the p-value is close to zero i.e., the probability of getting the given distribution of houseprices in OldTown under the assumption that it its mean is the same as the mean of all house prices.
* So what can we infer from the p-value of our test? What should be the p-value beyond which we reject the null hypothesis.
* The p-value below which we reject our hypothesis depends on our **significance level** $\alpha$
* For a 95% signifigance level we reject our null hypothesis if p-value is below 0.05
* In this case we can reject the null hypothesis at 95% significance.

## Another way to test: Student's t-test
***
* The T-test is a statistical test used to determine whether a numeric data sample of differs significantly from the population or whether two samples differ from one another.
* A z-test assumes a sample size >30 to work, but what if our sample is less than 30?
* A t-test solves this problem and gives us a way to do a hypothesis test on a smaller sample.

## Oh John
***
- Now, John also wants to see if house prices in `Stone Brook` neighborhood are different from the rest of the Houses in Brooklyn

In [5]:
print('No of houses in Stone Brook: {}'\
      .format(data['Neighborhood'].value_counts()['StoneBr']))

No of houses in Stone Brook: 25


* Lets do a t-test to test our hypothesis

In [8]:
from scipy import stats
stats.ttest_1samp(a= data[data['Neighborhood'] == 'StoneBr']['SalePrice'],               # Sample data
                 popmean= data['SalePrice'].mean())  # Pop mean

Ttest_1sampResult(statistic=5.735070151700397, pvalue=6.558704101036394e-06)

* The p-value in this case again is low and we can reject our null hypothesis
***

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
### Type I and Type II Error
***
* If we again think of hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of null and alternate hypothesis:

[Trial](images/jury.png)
    * Null Hypothesis: Defendant is innocent
    * Alternate Hypothesis: Defendant is guilty


***
* What type of error is being committed in the following circumstances?
    * Declaring the defendant guilty when they are actually innocent?
    * Declaring the defendant innocent when they are actually guilty?
    
* The first one is a type I error also known as a "false positive" or "false hit".
* The type 1 error rate is equal to the significance level α, so setting a higher confidence level (and therefore lower alpha) reduces the chances of getting a false positive.

* The second one is a type I error also known as a "false negative" or "miss". The higher your confidence level, the more likely you are to make a type II error.

***
<center><img src="../images/hyp2.png" alt="Drawing" style="width: 600px;"/></center>

 ***
 <center><img src="../images/hyp3.png" alt="Drawing" style="width: 600px;"/></center>

## Type 1 Error 
***
Type I error describes a situation where you reject the null hypothesis when it is actually true. 

This type of error is also known as a false positive or false hit.

The type 1 error rate is equal to the significance level α, so setting a higher confidence level (and therefore lower alpha) reduces the chances of getting a false positive.

## Type 2 error
***
Type II error describes a situation where you fail to reject the null hypothesis when it is actually false. 

Type II error is also known as a false negative or miss. The higher your confidence level, the more likely you are to make a type II error.

## Chi-Squared Goodness-Of-Fit Test
***
A chi-squared goodness of fit tests whether the distribution of sample categorical data matches an expected distribution. 

* For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.

When working with categorical data the values the observations themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. 

***
Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.

Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:

In [9]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)          

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")
minnesota_table = pd.crosstab(index=minnesota[0], columns="count")

print( "National")
print(national_table)
print(" ")
print( "Minnesota")
print(minnesota_table)

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000
 
Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:

>$sum((observed−expected)^2/expected)$


In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. 

Let's calculate the chi-squared statistic for our data to illustrate:

In [10]:
observed = minnesota_table

national_ratios = national_table/len(national)  # Get population ratios

expected = national_ratios * len(minnesota)   # Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum()

print(chi_squared_stat)

col_0
count    18.194805
dtype: float64


**Note:** The chi-squared test assumes none of the expected counts are less than 5.

Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. 

The scipy library shorthand for the chi-square distribution is chi2. 

Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:

In [11]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 4)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=4)
print("P value")
print(p_value)

Critical value
9.487729036781154
P value
[0.00113047]


**Note:** we are only interested in the right tail of the chi-square distribution. Read more on this [here](https://en.wikipedia.org/wiki/Chi-squared_distribution).

Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.

You can carry out a chi-squared goodness-of-fit test automatically using the scipy function scipy.stats.chisquare():

In [12]:
stats.chisquare(f_obs= observed,   # Array of observed counts
                f_exp= expected)   # Array of expected counts

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

The test results agree with the values we calculated above.

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

### Chi-Squared Test of Independence
***
Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another. 

For instance, the month you were born probably doesn't tell you anything which web browser you use, so we'd expect birth month and browser preference to be independent. 

On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.

The chi-squared test of independence tests whether two categorical variables are independent. 



## John's final experiments
***
John wants to test if knowing `LandContour` which is the overall flatness of the property tells him anything about the price

 -  He has divided the `SalePrice` in three buckets - High, medium, low

In [13]:
import scipy.stats as sp
def compute_freq_chi2(x,y):
    """This function will compute frequency table of x an y
    Pandas Series, and use the table to feed for the contigency table
    
    Parameters:
    -------
    x,y : Pandas Series, must be same shape for frequency table
    
    Return:
    -------
    None. But prints out frequency table, chi2 test statistic, and 
    p-value
    """
    freqtab = pd.crosstab(x,y)
    print("Frequency table")
    print("============================")
    print(freqtab)
    print("============================")
    chi2,pval,dof,expected = sp.chi2_contingency(freqtab)
    print("ChiSquare test statistic: ",chi2)
    print("p-value: ",pval)
    return

In [14]:
price = pd.qcut(data['SalePrice'], 3, labels = ['High', 'Medium', 'Low'])
compute_freq_chi2(data.LandContour, price)

Frequency table
SalePrice    High  Medium  Low
LandContour                   
Bnk            32      20   11
HLS            10      12   28
Low             8      11   17
Lvl           437     447  427
ChiSquare test statistic:  26.252544346201447
p-value:  0.00019976918050008285


* The low p-value tells us that the two variables aren't independent and knowing the `LandContour` of a house does tells us something about its `SalePrice`.
* The frequency distribution reflects this.
* Houses that are Near Flat/Level(Lvl) have an equal distribution of `SalePrice`.
* On the other hand houses that are at a Hillside i.e., Significant slope from side to side (HLS) have almost thrice as much houses with low price than high prices.

<img src="../images/icon/Recap.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
<br /> 

# Recap Time
***
* Hyothesis Testing
* Type 1 & Type 2 error
* Chi-Squared Test

# Thank You
***
For more queries - Reach out to vikash.sharma@msci.com 