## Hypothesis Testing

Hypothesis testing is an important part of statistics and data analysis. Most of the time it is practically not possible to take data from a total population. In that case, we take a sample and make estimations or claims about the total population. These assumptions or claims are hypotheses. Hypothesis testing is the process to test if there is evidence to reject that hypothesis.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats.distributions as dist

In [2]:
df = pd.read_csv('./data/heart2.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


The last column of the dataset is ‘AHD’. That is if the person has heart disease. The research question for this section is,

#### “The population proportion of Ireland having heart disease is 42%. Are more people suffering from heart disease in the US”?

#### Step 1 
Define the null hypothesis and alternative hypothesis.
In this problem, the null hypothesis is the population proportion having heart disease in the US is less than or equal to 42%. But if we test for equal to less than will be covered automatically. So, I am making it only equal to.
And the alternative hypothesis is the population proportion of the US having heart disease is more than 42%.

In [5]:
#Ho: p0 = 0.42  #null hypothesis Ho
#Ha: p > 0.42  #alternative hypothesis Ha

#### Step 2
Assume that the dataset above is a representative sample from the population of the US. So, calculate the population proportion of the US having heart disease.

In [6]:
p_us = len(df[df['target']==1])/len(df)
p_us

0.5131707317073171

The population proportion of the sample having heart disease is 0.46 or 46%. This percentage is more than the null hypothesis. That is 42%.
But the question is if it is significantly more than 42%. If we take a different simple random sample, the currently observed population proportion (46%) can be different.
To find out if the observed population proportion is significantly more than the null hypothesis, perform a hypothesis test.

#### Step 3
Calculate the Test Statistic:


test_statistics = (Best_Estimate - Hypothesized_Estimate)/Standard_error_of_estimate

SE = sqrt((p0*(1-p0))/n)

In this formula, p0 is 0.42 (according to the null hypothesis) and n is the size of the sample population. 

In [18]:
se = np.sqrt(0.49 * (1-0.49) / len(df))

In [19]:
#Best estimate
be = p_us  #hypothesized estimate
he = 0.49
test_stat = (be - he)/se
test_stat

1.483947557138075

#### Step 4
    
Calculate the p-value


This test statistic is also called z-score. You can find the p-value from a z_table or you can find the p-value from this formula in python.

In [20]:
pvalue = 2*dist.norm.cdf(-np.abs(test_stat))
pvalue

0.13782283387121433

#### Step 5 
Infer the conclusion from the p-value


Consider the significance level alpha to be 5% or 0.05. A significance level of 5% or less means that there is a probability of 95% or greater that the results are not random.
Here p-value is bigger than our considered significance level of 0.05. So, we cannot reject the null hypothesis. That means there is no significant difference in population proportion having heart disease in Ireland and the US.

### Hypothesis Tests for the Difference in Two Proportions