# 5-5 Follow-along-guide Conduct a Hypothesis Test

## Import packages and libraries

Before you begin the activity, import all the required libraries and extensions. Throughout the course, you will be using pandas and scipy stats for operations.

In [1]:
import pandas as pd
from scipy import stats

In [2]:
education_districtwise = pd.read_csv("education_districtwise.csv")
education_districtwise = education_districtwise.dropna()
education_districtwise.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [3]:
education_districtwise.size

4438

## Activity overview

This activity continues the scenario from an earlier part of the course, in which you are a data professional working for the Department of Education of a large nation. Recall that you are analyzing data on the literacy rate for each district.

Now imagine that the Department of Education asks you to collect data on mean district literacy rates for two of the nation’s largest states: STATE21 and STATE28. STATE28 has almost 40 districts, and STATE21 has more than 70. Due to limited time and resources, you are only able to survey 20 randomly chosen districts in each state. The department asks you to determine if the difference between the two mean district literacy rates is statistically significant or due to chance. This will help the department decide how to distribute government funding to improve literacy. If there is a statistically-significant difference, the state with the lower literacy rate may receive more funding. 

In this activity, you will use Python to simulate taking a random sample of 20 districts in each state and conduct a two-sample t-test based on the sample data. 


## Explore the data

To start, filter your dataframe for the district literacy rate data from the states STATE21 and STATE28. 

First, name a new variable: `state21`. Then, use the relational operator for equals (`==`) to get the relevant data from the `STATNAME` column. 

In [4]:
print(education_districtwise['STATNAME'].value_counts())

STATNAME
STATE21    71
STATE22    50
STATE28    38
STATE17    35
STATE13    33
STATE6     30
STATE20    30
STATE33    27
STATE24    27
STATE9     26
STATE23    24
STATE1     22
STATE25    21
STATE34    20
STATE26    20
STATE3     16
STATE31    16
STATE5     14
STATE15    13
STATE7     13
STATE16    12
STATE27    11
STATE29    10
STATE35     9
STATE2      9
STATE12     8
STATE4      7
STATE14     4
STATE18     4
STATE32     4
STATE11     3
STATE10     2
STATE30     2
STATE36     1
STATE19     1
STATE8      1
Name: count, dtype: int64


In [5]:
state21 = education_districtwise[education_districtwise['STATNAME'] == "STATE21"]
state21.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
133,DISTRICT607,STATE21,14,1357,127,3464228.0,72.03
134,DISTRICT50,STATE21,12,594,86,4138605.0,70.11
135,DISTRICT61,STATE21,16,1919,159,3683896.0,70.43
136,DISTRICT191,STATE21,10,1141,69,4773138.0,58.67
137,DISTRICT328,STATE21,7,1116,85,2335398.0,55.08


In [6]:
state21.info()

<class 'pandas.core.frame.DataFrame'>
Index: 71 entries, 133 to 204
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    71 non-null     object 
 1   STATNAME    71 non-null     object 
 2   BLOCKS      71 non-null     int64  
 3   VILLAGES    71 non-null     int64  
 4   CLUSTERS    71 non-null     int64  
 5   TOTPOPULAT  71 non-null     float64
 6   OVERALL_LI  71 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 4.4+ KB


Next, name another variable: `state28`. Follow the same procedure to get the relevant data from the `STATNAME` column. 

In [7]:
state28 = education_districtwise[education_districtwise['STATNAME'] == "STATE28"]
state28.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
208,DISTRICT495,STATE28,18,1210,193,3922780.0,58.06
209,DISTRICT208,STATE28,27,1534,251,5082868.0,58.26
210,DISTRICT618,STATE28,5,183,34,656916.0,56.0
211,DISTRICT554,STATE28,17,852,169,3419622.0,53.53
212,DISTRICT642,STATE28,21,1102,241,4476044.0,60.9


In [8]:
state28.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38 entries, 208 to 245
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    38 non-null     object 
 1   STATNAME    38 non-null     object 
 2   BLOCKS      38 non-null     int64  
 3   VILLAGES    38 non-null     int64  
 4   CLUSTERS    38 non-null     int64  
 5   TOTPOPULAT  38 non-null     float64
 6   OVERALL_LI  38 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 2.4+ KB


### Simulate random sampling

Now that you have organized your data, use the `sample()` function to take a random sample of 20 districts from each state. First, name a new variable: `sampled_state21`. Then, enter the arguments of the `sample()` function. 

*   `n`: Your sample size is `20`. 
*   `replace`: Choose `True` because you are sampling with replacement.
*   `random_state`: Choose an arbitrary number for the random seed – how about `5200`. 
. 

In [15]:
sampled_state21 = state21.sample(n=20, replace = True, random_state=5200)
sampled_state21.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 171 to 202
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    20 non-null     object 
 1   STATNAME    20 non-null     object 
 2   BLOCKS      20 non-null     int64  
 3   VILLAGES    20 non-null     int64  
 4   CLUSTERS    20 non-null     int64  
 5   TOTPOPULAT  20 non-null     float64
 6   OVERALL_LI  20 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.2+ KB


Now, name another variable: `sampled_state28`. Follow the same procedure, but this time choose a different number for the random seed; for example, 5200. 

In [16]:
sampled_state28 = state28.sample(n=20, replace = True, random_state=5200)
sampled_state28.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 242 to 237
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    20 non-null     object 
 1   STATNAME    20 non-null     object 
 2   BLOCKS      20 non-null     int64  
 3   VILLAGES    20 non-null     int64  
 4   CLUSTERS    20 non-null     int64  
 5   TOTPOPULAT  20 non-null     float64
 6   OVERALL_LI  20 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.2+ KB


### Compute the sample means

You now have two random samples of 20 districts—one sample for each state. Next, use `mean()` to compute the mean district literacy rate for both STATE21 and STATE28.

In [17]:
sampled_state21['OVERALL_LI'].mean()

np.float64(66.1445)

In [18]:
sampled_state28['OVERALL_LI'].mean()

np.float64(64.85799999999999)

STATE21 has a mean district literacy rate of about X %, while STATE28 has a mean district literacy rate of about Y%.

Based on your sample data, the observed difference between the mean district literacy rates of STATE21 and STATE28 is z percentage points (X% - Y%). 

**Note**: At this point, you might be tempted to conclude that STATE21 has a higher overall literacy rate than STATE28. However, due to sampling variability, this observed difference might simply be due to chance, rather than an actual difference in the corresponding population means. A hypothesis test can help you determine whether or not your results are statistically significant. 

### Conduct a hypothesis test

Now that you’ve organized your data and simulated random sampling, you’re ready to conduct your hypothesis test. Recall that a two-sample t-test is the standard approach for comparing the means of two independent samples. To review, the steps for conducting a hypothesis test are:

1.   State the null hypothesis and the alternative hypothesis.
2.   Choose a significance level.
3.   Find the p-value. 
4.   Reject or fail to reject the null hypothesis.

#### Step 1: State the null hypothesis and the alternative hypothesis

The **null hypothesis** is a statement that is assumed to be true unless there is convincing evidence to the contrary. The **alternative hypothesis** is a statement that contradicts the null hypothesis and is accepted as true only if there is convincing evidence for it. 

In a two-sample t-test, the null hypothesis states that there is no difference between the means of your two groups. The alternative hypothesis states the contrary claim: there is a difference between the means of your two groups. 

We use $H_0$ to denote the null hypothesis and $H_A$ to denote the alternative hypothesis.

*   $H_0$: There is no difference in the mean district literacy rates between STATE21 and STATE28.
*   $H_A$: There is a difference in the mean district literacy rates between STATE21 and STATE28.



#### Step 2: Choose a significance level

The **significance level** is the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. The Department of Education asks you to use their standard level of 5%, or 0.05.  

#### Step 3: Find the p-value

**P-value** refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.

Based on your sample data, the difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points. Your null hypothesis claims that this difference is due to chance. Your p-value is the probability of observing an absolute difference in sample means that is 6.2 or greater *if* the null hypothesis is true. If the probability of this outcome is very unlikely—in particular, if your p-value is *less than* your significance level of 5%— then you will reject the null hypothesis.

#### `scipy.stats.ttest_ind()`

For a two-sample $t$-test, you can use `scipy.stats.ttest_ind()` to compute your p-value. This function includes the following arguments:

*   `a`: Observations from the first sample 
*   `b`: Observations from the second sample
*   `equal_var`: A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, you don’t have access to data for the entire population, so you don’t want to assume anything about the variance. To avoid making a wrong assumption, set this argument to `False`. 

**Reference:** [scipy.stats.ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)


Now write your code and enter the relevant arguments: 

*   `a`: Your first sample refers to the district literacy rate data for STATE21, which is stored in the `OVERALL_LI` column of your variable `sampled_ state21`.
*   `b`: Your second sample refers to the district literacy rate data for STATE28, which is stored in the `OVERALL_LI` column of your variable `sampled_ state28`.
*   `equal_var`: Set to `False` because you don’t want to assume that the two samples have the same variance.

In [19]:
# n < 30, t-distribution: sample a, sample b, assume not same variance.

stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b=sampled_state28['OVERALL_LI'], equal_var=False)

TtestResult(statistic=np.float64(0.4648579183454708), pvalue=np.float64(0.645250458938771), df=np.float64(31.30509760355441))

Your p-value is about 0.0064, or 0.64%. 

This means there is only a 0.64% probability that the absolute difference between the two mean district literacy rates would be 6.2 percentage points or greater if the null hypothesis were true. In other words, it’s highly unlikely that the difference in the two means is due to chance.

#### Step 4: Reject or fail to reject the null hypothesis

To draw a conclusion, compare your p-value with the significance level.

*   If the p-value is **less** than the significance level, you can conclude that there is a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you will **reject** the null hypothesis $H_0$.
*   If the p-value is **greater** than the significance level, you can conclude that there is *not* a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you will **fail to reject** the null hypothesis $H_0$.

Your p-value of 0.0064, or 0.64%, is less than the significance level of 0.05, or 5%. Therefore, you will *reject* the null hypothesis and conclude that there is a statistically significant difference between the mean district literacy rates of the two states: STATE21 and STATE28. 

In [21]:
# Extracting pvalue and make the test

statistic, pvalue = stats.ttest_ind(a=sampled_state21['OVERALL_LI'], b=sampled_state28['OVERALL_LI'], equal_var=False)

print ("pvalue:",pvalue)
print()

if pvalue < 0.05:
    
    print('pvalue < 0.05, Reject Ho.')          # Ha: There is a difference in the mean
else:
    print('pvalue > 0.05, Fail to reject Ho.')  # Ho: There is no difference in the mean

pvalue: 0.645250458938771

pvalue > 0.05, Fail to reject Ho.


    There is a statistically significant difference between the mean district literacy rates of the two states: 
    STATE21 and STATE28.

Your analysis will help the Department of Education decide how to distribute government resources. Since there is a statistically significant difference in mean district literacy rates, the state with the lower literacy rate, STATE28, will likely receive more resources to improve literacy. 