# OVERVIEW

We will use Python to simulate taking a random sample of 20 districts in each state and conduct a two-sample t-test based on the sample data

In [1]:
import pandas as pd
from scipy import stats

education_districtwise = pd.read_csv("education_districtwise.csv")
education_districtwise = education_districtwise.dropna()

## EXPLORE THE DATA

We are going to use STATE21 and STATE28 as our examples. 

In [3]:
state21 = education_districtwise[education_districtwise['STATNAME'] == 'STATE21']
state28 = education_districtwise[education_districtwise['STATNAME'] == 'STATE28']

## SIMULATE RANDOM SAMPLING


Now that you have organized your data, use the `sample()` function to take a random sample of 20 districts from each state. First, name a new variable: `sampled_state21`. Then, enter the arguments of the `sample()` function. 

*   `n`: Your sample size is `20`. 
*   `replace`: Choose `True` because you are sampling with replacement.
*   `random_state`: Choose an arbitrary number for the random seed – how about `13490`. 

In [5]:
sampled_state21 = state21.sample(n = 20,
                                replace = True,
                                random_state = 13490)

sampled_state28 = state28.sample(n = 20,
                                replace = True,
                                random_state = 39103)

## COMPUTE THE MEANS

In [8]:
sampled_state21['OVERALL_LI'].mean()

70.82900000000001

In [9]:
sampled_state28['OVERALL_LI']. mean()

64.60100000000001

STATE21 has a mean district literacy rate of about **70.8%**, while STATE28 has a mean district literacy rate of about **64.6%**.

Based on your sample data, the observed difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points (70.8% - 64.6%). 

Due to sample variability, this observed difference might simple be due to chance, rather than an actual difference in the corresponding population means. A hypothesis test can help us determine whether or not this difference is statistically significant.

## CONDUCT A HYPOTHESIS TEST

### State the null hypothesis 

- 𝐻0: There is no difference in the mean district literacy rates between STATE21 and STATE28.
- 𝐻𝐴: There is a difference in the mean district literacy rates between STATE21 and STATE28.

### Choose the significance level

- standard level of 5% (0.05)

### Find the p-value

**P-value** refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.

Based on your sample data, the difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points. Your null hypothesis claims that this difference is due to chance. Your p-value is the probability of observing an absolute difference in sample means that is 6.2 or greater *if* the null hypothesis is true. If the probability of this outcome is very unlikely—in particular, if your p-value is *less than* your significance level of 5%— then you will reject the null hypothesis.

In [10]:
stats.ttest_ind(a = sampled_state21['OVERALL_LI'],
               b = sampled_state28['OVERALL_LI'],
               equal_var = False)

Ttest_indResult(statistic=2.8980444277268735, pvalue=0.0064217191427652365)

***p-value = 0.0064 // 0.64% *** 

Meaning --> there is only a 0.64% probability that the absolute difference between the two mean district literacy rates would be 6.2% or greater if the null hypothesis was true. That is, it is highly unlikely that the difference in the two means is due to chance. 

### Reject or fail to reject the null hypothesis

To draw a conclusion, compare your p-value with the significance level:

- if the p-value is less than the significance level, you can conclude that there is a statistically significant difference in the mean district literacy rates between STATE21 and STATE28 --> **Reject the null hypothesis**
- if the p-value is greater than the significance level, you can conclude that there is *not* a statistically significant difference in the mean district literacy rates between STATE21 and STATE28 --> **Fail to reject the null hypothesis**

The p-value of our data is **0.0064** which is less htan the threshold of 0.05. Therefore, we can **reject the null hypothesis**  and we can say that there is a statistically signifcant difference between the mean district literacy rates of the two states.