    
<h1 style="text-align: center; color: purple;" markdown="1">Econ 220 Python Lab - Inference: T-test of Difference in Means</h1>

<h2 style="text-align: center; color: #012169" markdown="1">Handout</h2>


### Table of Contents 
* [Testing for Differences in Mean](#anchor1)
    * [Proportion Tables](#anchor3)
    * [Box Plots to See the Distribution of the Data by Groups](#anchor4)
    * [Descriptive Statistics using the Pandas & Numpy Packages](#anchor5)
    * [The `ttest_ind` for Difference in Means](#anchor6)
        * [T-test for Differences in Weight by Gender](#anchor7)
        * [T-test for Differences in Weight by Habit](#anchor8)
* [Testing for Differences in Proportion](#anchor2)
    * [The `proportions_ztest` for Difference in Proportions](#anchor9)
        * [Z-test for Differences in proportion of Gender](#anchor10)


# Testing for Differences in Means <a class = anchor id = anchor1></a>

When two groups have sample means that are numerically different the question is: Is this numerical difference statistically significant? Can I use statistics to prove that the difference in the groups of my sample could be reflecting differences in the population. Then you test for this differences in the means, the test that you use for that is the **t-test**

1. State the null hypothesis $H_0$ and the alternative hypothesis $H_a$ 
$$H_0: \mu_1 = \mu_2 \\ H_1:  \mu_1 \neq \mu_2$$
This is equivalent to: 
$$H_0: \mu_1 - \mu_2=d \\ H_1:  \mu_1-\mu_2\neq d $$
when $d = 0$


2. Set the level of significance at $\alpha$ and find the value of $t$ associated with it, $t_\alpha$ or critical value. 
3. Calculate the t-test statistic 

$t=\frac{[\bar{x_1}-\bar{x_2} - d]}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$


4. Calculate the $p-value$ 
5. Make a decision. Check whether to reject the null hypothesis by comparing $p-value$ to $\alpha$ and/or t to the $t_\alpha$. If $p-value<\alpha$ then reject $H_0: \mu_1 - \mu_2=d$

You can perform this in python step by step calculating t and comparaing it to the critical value or you can use a function that does this for you. The scipy package has the function`ttest_ind` that performs this test. The documentation for the python function to perform this test can be found [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)


But before running either of these test, it's worth investigating a couple of things about your data and your sample. Specifically, you should check a few things and plot a few graphs to inform your hypothesis. We start by first looking at our data and its variables using some summary statistics. 

We will be using some Birth data from North Carolina, this has some information about parents and babies.

In [None]:
# Import necessary packages here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
# Don't forget to add the path for your own computer 

In [None]:
# Use the path for your computer
BirthdataNC = pd.read_csv()
BirthdataNC.info()

## Let's look at the data <a class = anchor id = anchor2></a>

Now let's look at the proportion of female births in this dataset (see the code below). It is good to see how the groups are represented in your data. 

Here we are going to look at the weight distribution by gender and habit (smoker vs. non-smoker) 

In [None]:
#ax = sns.boxplot(x="gender", y="weight", data=BirthdataNC)
ax = sns.histplot(x = BirthdataNC.weight, 
                  hue = BirthdataNC.gender,
                  palette = ['purple','pink'], 
                                  kde=True, alpha=0.5)
plt.axvline(BirthdataNC[BirthdataNC['gender']=='female'].
            weight.mean(), color='pink', 
            linestyle='dashed', linewidth=1)
plt.axvline(BirthdataNC[BirthdataNC['gender']=='male'].
            weight.mean(), color='purple', 
            linestyle='dashed', linewidth=1)
#ax = sns.violinplot(x="gender", y="weight", data=BirthdataNC, palette="Blues")
ax.set_title("Histogram: Birthweight by gender"); # the colon here eliminates the output message try without it. 

In [None]:
# I suggest to have the warnings commented 
#and just activate the code once you know what the warning is about
# import warnings
# warnings.filterwarnings("ignore")

# Make a dictionary with one specific color per group:
my_pal = {"nonsmoker": "lightgreen", "smoker": "purple"}

ax = sns.boxplot(x="habit", y="weight", 
                 data=BirthdataNC, palette=my_pal)
ax = sns.swarmplot(x="habit", y="weight", 
                   data=BirthdataNC, color="grey", alpha=0.5)


We first looked at visual differences, now let's see the actual values 

First, let's check the proportion of females and how you can calculate this in different ways. 
Then calculate the mean and standard deviation of weight by gender and habit.

The standard deviation is good to have, but to assess statistical significance we really want to have the standard error (which is the standard deviation adjusted by the group size).

In [None]:
female = BirthdataNC[BirthdataNC['gender']=='female']
male = BirthdataNC[BirthdataNC['gender']=='male']
# Calculate the mean standard, deviation,and standard error of mean for the weight of females and males (by group)

BirthdataNC.pivot_table(values=["weight"], 
                        index="gender", 
                        aggfunc={'mean', 'std'})
#BirthdataNC.groupby('gender')['weight', 'habit'].agg(['mean', 'std'])

In [None]:
# Calculate the mean standard, deviation,and standard error of mean for the weight of individuals base on their habits
BirthdataNC.groupby('habit')['weight'].agg(['mean', 'std', 'sem'])

## The `ttest_ind` for Difference in Means <a class = anchor id = anchor5></a>

Are these differences significant? We need to test for this, meaning you need to test if given the information in this sample the difference of weight by gender is statistically different from zero.

To test differences in means you use a t-test. 
Here is the way you state your test, by using a null $H_0$ and  $H_a$ alternative hypothesis 


$$H_0 : \mu_f- \mu_m = 0$$
$$H_a : \mu_f- \mu_m  \neq 0$$
Where $\mu$ is the mean of one variable per group

The T-test is a statistical test that allows you to check for differences in means among different groups. 

`ttest_ind()`

Let's do a t-test for the differences in mean separating the sample by using a couple different variables: **gender and smoking habits** 


<h4 style="color: #012169" markdown="1">T-test for Differences in Weight by Gender: </h4> <a class = anchor id = anchor6></a>

In [None]:
from scipy.stats import ttest_ind
#from statsmodels.stats.proportion import proportions_ztest
print('\nDifference in means between female vs male\n', 
      _____)


In [None]:

#run independent sample T-Test 
tStat, pValue = stats.ttest_ind(______________ , 
                                _______________)
#print the P-Value and the T-Statistic
print("\nT-test for differences in weight by gender:\n P-Value:{0}\n T-Statistic:{1}\n". 
        format(round(pValue,3),
               round(tStat,3))) 



<h4 style="color: #012169" markdown="1">T-test for Differences in Weight by Habit: </h4> <a class = anchor id = anchor7></a>

In [None]:
# smoke = BirthdataNC[BirthdataNC['habit']=='nonsmoker']['weight']
# nonsmoke = BirthdataNC[BirthdataNC['habit']=='smoker']['weight']
# array = ([smoke, nonsmoke])
# Here is a good place to use a lambda expression
BirthdataNC.groupby('habit')['weight'].apply(__________)

In [None]:


print('\nDifference in means between nonsmokers vs smokers\n',round(BirthdataNC.groupby('habit')['weight'].mean()[0]-BirthdataNC.groupby('habit')['weight'].mean()[1],5))

testhabit=ttest_ind(*_____________)
# the * before the list is used for variadic arguments,
# which in this case means that is going to take 
# the two elements of the list as positional arguments 
print("\nT-test for differences in weight by habit:\n P-Value: {0}\n T-Statistic: {1}\n".
        format(round(testhabit.statistic,3),
               round(testhabit.pvalue,3))) #print the P-Value and the T-Statistic


# Testing for Differences in Proportions <a class = anchor id = anchor2></a>

1. State the null hypothesis $H_0$ and the alternative hypothesis $H_a$ 

One sample test $$H_0: p = p_o \\ H_1: p \neq p_o$$
Two sample test $$H_0: p_1 = p_2 \\ H_1: p_1 \neq p_2$$

2. Set the level of significance at $\alpha$ and find the value of z associated with it, $z_\alpha$ or critical value. 
3. Calculate the test statistic 
One sample test
$$z=\frac{\hat{p}-p_o}{\sqrt{\frac{p_o(1-p_o)}{n}}}$$
Two sample test 

two-proprtion-z-test

$$z=\frac{(\hat{p_1}-\hat{p_2})-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$$

where $\hat{p}$ = The overall sample proportion. The numerator will be the total number of “positive” results for the two samples and the denominator is the total number of people in the two samples.

4. Calculate the $p-value$  
5. Make a decision. Check whether to reject the null hypothesis by comparing $p-value$ to $\alpha$ and/or t to the $t_\alpha$. If $p-value<\alpha$ then reject $H_0$

You can perform this in python step by step calculating z and comparaing it to the critical value or you can use a function that does this for you. The `statsmodels.stats.proportion` package has the function`proportions_ztest` that performs this test. 

`statsmodels.stats.proportion.proportions_ztest(count, nobs, value=None, alternative='two-sided', prop_var=False)`
`count` the number of successes in nobs trials
`nobs` the number of trials or observations, with the same length as count.
`value` This is the value of the null hypothesis equal to the proportion in the case of a one sample test. In the case of a two-sample test, the null hypothesis is that prop[0] - prop[1] = value, where prop is the proportion in the two samples. If not provided value = 0 and the null is prop[0] = prop[1]

The documentation for the python function to perform this test can be found [here](https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html)


In [None]:
print('We have a sample of', 
      BirthdataNC['gender'].shape[0],'mothers')
print("The gender of the babies in the sample is distributed:\n",
      BirthdataNC['gender'].value_counts())
print("The proportion of babies of each gender is:\n",
      BirthdataNC['gender'].value_counts(normalize=True))

Let's test whether the porportion of women is 50% 

In [None]:

(zstat,pvalue ) =proportions_ztest(BirthdataNC['gender'].value_counts()[0], 
                                   BirthdataNC['gender'].shape[0], 
                                   0.5)
print("\n Z-test for differences in gender proportion:\n Z-Statistic: {0}\n P-Value: {1}\n".
        format(round(zstat,4),
               round(pvalue,4))) #print the P-Value and the T-Statistic

#(0.5 -0.497)/np.sqrt(((0.497*0.503)/1000))

**Now let's test for difference in the proportion of babies with low birth weight whose mother were smokers vs nonmokers**

In [None]:

table = pd.DataFrame(pd.crosstab(index = __________ , 
            columns =_______, margins=_______))
print("Number of low birthweight per mothers habit \n",table)

print("Proportion of low birthweight per mothers habit \n", 
      pd.crosstab(index = BirthdataNC.________, 
            columns =BirthdataNC._____________ , 
            normalize='index', 
            margins=True).round(4))


In [None]:
# Directly copying the numbers from the table above 
(zstat,pvalue ) =proportions_ztest([92,18], [873, 126]
                                   ,0)

# extract numbers from table
(______, _________) =proportions_ztest(bla.iloc[:2,0], 
                                       bla.iloc[:2,2]
                                       ,0)

#print the P-Value and the z-Statistic

print("\n Z-test for differences low birth wieght by mother habit\n Z-Statistic: {0}\n P-Value: {1}\n".
        format(round(zstat,4),
               round(pvalue,4))) 

**Interpretation:** 
In this sample, we fail to reject the null hipotesis that the smoking causes a higher proportion of babyes with low birth weight. In other words we can not say that the difference in the porportion of low birth weight is statistically significant in this sample. 

&nbsp;
<hr />
<p style="font-family:palatino; text-align: center;font-size: 15px">ECON220 Python Programming Laboratory</a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px">Professor <em> Paloma Lopez de mesa Moyano</em></a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px"><span style="color: #6666FF;"><em>paloma.moyano@emory.edu</em></span></p>

<p style="font-family:palatino; text-align: center;font-size: 15px">Department of Economics</a></p>
<p style="font-family:palatino; text-align: center; color: #012169;font-size: 15px">Emory University</a></p>

&nbsp;

In [None]:
# !jupyter nbconvert --to html nameoffile.ipynb