# Hypothesis Testing

In this notebook we demonstrate formal hypothesis testing using the NHANES data.

It is important to note that the NHANES data are a "complex survey".  The data are not an independent and representative sample from the target population.  Proper analysis of complex survey data should make use of additional information about how the data were collected.  Since complex survey analysis is a somewhat specialized topic, we ignore this aspect of the data here, and analyze the NHANES data as if it were an independent and identically distributed sample from a population.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg') # workaround, there may be a better way
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats.distributions as dist

Below we read the data, and convert some of the integer codes to text values.

In [2]:
url = "nhanes_2015_2016.csv"
da = pd.read_csv(url)

da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})

In [3]:
da["SMQ020x"].value_counts()

No     3406
Yes    2319
Name: SMQ020x, dtype: int64

In [4]:
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["RIAGENDRx"].value_counts()

Female    2976
Male      2759
Name: RIAGENDRx, dtype: int64

### Hypothesis Tests for One Proportion

The most basic hypothesis test may be the one-sample test for a proportion.  This test is used if we have specified a particular value as the null value for the proportion, and we wish to assess if the data are compatible with the true parameter value being equal to this specified value.  One-sample tests are not used very often in practice, because it is not very common that we have a specific fixed value to use for comparison. For illustration, imagine that the rate of lifetime smoking in another country was known to be 40%, and we wished to assess whether the rate of lifetime smoking in the US were different from 40%.  In the following notebook cell, we carry out the (two-sided) one-sample test that the population proportion of smokers is 0.4, and obtain a p-value of 0.43.  This indicates that the NHANES data are compatible with the proportion of (ever) smokers in the US being 40%. 

##### One-sample tests for proportions are not used very often in practice for a few reasons:

1. Lack of a specific fixed value: One-sample tests require a specific fixed value to compare the observed proportion against. However, in many real-world scenarios, we often don't have a predetermined value to use as the null hypothesis. Proportions are usually estimated from the data itself, making it challenging to define a specific fixed value for comparison.

2. Context-dependent proportions: Proportions can vary depending on the context, population, or time period under consideration. It is rare to have a universally agreed-upon proportion that can serve as a meaningful null hypothesis for comparison across different scenarios.

3. Two-sample tests or regression models: In many cases, researchers are more interested in comparing proportions between two groups (e.g., treatment vs. control) or exploring the relationship between proportions and other variables using regression models. These approaches offer more flexibility and provide additional insights into the data compared to one-sample tests.

4. Limitations of hypothesis testing: Hypothesis testing has its limitations and may not always be the most appropriate tool for making statistical inferences. Alternative approaches, such as confidence intervals or effect size estimation, are often favored as they provide more nuanced and informative results.

While one-sample tests for proportions have limited applications, they can still be useful in specific scenarios where a specific fixed value is available for comparison. However, it's important to consider the context, limitations, and alternative approaches to draw meaningful conclusions from the data.

In [5]:
x = da.SMQ020x.dropna() == "Yes"

In [6]:
p = x.mean()

In [7]:
p, len(x)

(0.4050655021834061, 5725)

In [8]:
se = np.sqrt(.4 * (1 - .4)/ len(x))
se

0.00647467353462031

In [9]:
test_stat = (p - 0.4) / se
test_stat

0.7823563854332805

---
##### `dist.norm.cdf(test_stat)`
`dist.norm` represents the normal distribution object from the scipy.stats module, specifically the standard normal distribution (mean=0, standard deviation=1).

The `cdf` function is then called on the normal distribution object to calculate the cumulative probability. It takes a value as input and returns the probability that a random variable from the standard normal distribution is less than or equal to that value.

For example, if you have a test statistic and you want to calculate the probability of observing a value less than or equal to that test statistic under the assumption of a standard normal distribution, you can use `dist.norm.cdf(test_stat)`.

In the context of hypothesis testing, the CDF is often used to calculate the p-value. The p-value is the probability of observing a test statistic as extreme or more extreme than the one obtained, assuming the null hypothesis is true. By using the CDF, you can calculate this probability and obtain the corresponding p-value for hypothesis testing.

In [10]:
pvalue = 2 * dist.norm.cdf(-np.abs(test_stat)) 
# dist.norm refers to the normal distribution object from the scipy.stats module, 
# which represents a standard normal distribution.
print(test_stat, pvalue)

0.7823563854332805 0.4340051581348052


The following cell carries out the same test as performed above using the Statsmodels library.  The results in the first (default) case below are slightly different from the results obtained above because Statsmodels by default uses the sample proportion instead of the null proportion when computing the standard error.  This distinction is rarely consequential, but we can specify that the null proportion should be used to calculate the standard error, and the results agree exactly with what we calculated above.  The first two lines below carry out tests using the normal approximation to the sampling distribution of the test statistic, and the third line below carries uses the exact binomial sampling distribution.  We can see here that the p-values are nearly identical in all three cases. This is expected when the sample size is large, and the proportion is not close to either 0 or 1.

In [11]:
sm.stats.proportions_ztest(x.sum(), len(x), 0.4)

(0.7807518954896244, 0.43494843171868214)

In [12]:
sm.stats.binom_test(x.sum(), len(x), 0.4)

0.4340360854410036

### Hypothesis Tests for Two Proportions

Comparative tests tend to be used much more frequently than tests comparing one population to a fixed value.  A two-sample test of proportions is used to assess whether the proportion of individuals with some trait differs between two sub-populations.  For example, we can compare the smoking rates between females and males. Since smoking rates vary strongly with age, we do this in the subpopulation of people between 20 and 25 years of age.  In the cell below, we carry out this test without using any libraries, implementing all the test procedures covered elsewhere in the course using Python code.  We find that the smoking rate for men is around 10 percentage points greater than the smoking rate for females, and this difference is statistically significant (the p-value is around 0.01).

In [13]:
dx = da[["SMQ020x", "RIDAGEYR", "RIAGENDRx"]].dropna()

dx.head()

Unnamed: 0,SMQ020x,RIDAGEYR,RIAGENDRx
0,Yes,62,Male
1,Yes,53,Male
2,Yes,78,Male
3,No,56,Female
4,No,42,Female


In [14]:
p = dx.groupby("RIAGENDRx")["SMQ020x"].agg([lambda z: np.mean(z == "Yes"), "size"])
p.columns = ["Smoke", "N"]
p

Unnamed: 0_level_0,Smoke,N
RIAGENDRx,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


Essentially the same test as above can be conducted by converting the "Yes"/"No" responses to numbers (Yes=1, No=0) and conducting a two-sample t-test, as below:

In [15]:
# Estimate of the combined population proportion
p_combined = (dx.SMQ020x == "Yes").mean()

# Estimate of the variance of the combined population proportion
var = p_combined * (1 - p_combined)

# Estimate of the standard error of the combined population proportion
se = np.sqrt(var * (1 / p.N.Female + 1 / p.N.Male))

In [16]:
print("Estimated Combined Population Proportion:", p_combined.round(4))
print("Estimated Variance of the Combined Population Proportion:", var.round(4))
print("Estimated Standard Error of the Combined Population Proportion:", se.round(4))

Estimated Combined Population Proportion: 0.4051
Estimated Variance of the Combined Population Proportion: 0.241
Estimated Standard Error of the Combined Population Proportion: 0.013


In [17]:
test_stat = (p.Smoke.Female - p.Smoke.Male) / se
p_value = 2 * dist.norm.cdf(-np.abs(test_stat))
(test_stat, p_value)

(-16.049719603652488, 5.742288777302776e-58)

In [18]:
dx_females = dx.loc[dx.RIAGENDRx == "Female", "SMQ020x"].replace({"Yes": 1, "No": 0})
dx_females.value_counts()

0    2066
1     906
Name: SMQ020x, dtype: int64

In [19]:
dx_males = dx.loc[dx.RIAGENDRx == "Male", "SMQ020x"].replace({"Yes": 1, "No": 0})
dx_males.value_counts()

1    1413
0    1340
Name: SMQ020x, dtype: int64

---
##### `proportions_ztest` - comparing the difference in proportions
The `sm.stats.ttest_ind` function from the `statsmodels` library is specifically designed to perform a t-test for the difference in means between two independent samples. It is not suitable for comparing the difference in proportions.

To compare the difference in proportions, you can use the `proportions_ztest` function from the same library. Here's the syntax for `proportions_ztest`:

```python
import statsmodels.stats.proportion as smprop

count1 = 100  # number of successes in sample 1
nobs1 = 200   # total number of observations in sample 1

count2 = 150  # number of successes in sample 2
nobs2 = 250   # total number of observations in sample 2

z_stat, p_value = smprop.proportions_ztest([count1, count2], [nobs1, nobs2])
print("Z-statistic:", z_stat)
print("P-value:", p_value)
```

In this example, `count1` and `count2` represent the number of successes (e.g., the number of individuals with a specific characteristic) in samples 1 and 2, respectively. `nobs1` and `nobs2` represent the total number of observations (e.g., the sample sizes) for samples 1 and 2.

The `proportions_ztest` function performs a z-test for the difference in proportions and returns the z-statistic and p-value. The results can be printed using `print`.

##### `sm.stats.ttest_ind` -  t-test for the difference in means between two independent samples
The `sm.stats.ttest_ind` is a function from the `statsmodels` library used to perform an independent two-sample t-test. This test compares the means of two independent samples to determine if there is a statistically significant difference between them.

The syntax for `sm.stats.ttest_ind` is as follows:

```python
sm.stats.ttest_ind(sample1, sample2, alternative='two-sided', usevar='pooled', value=0)
```

The parameters used in the function are:
- `sample1` and `sample2`: The two samples to compare. These can be NumPy arrays, Pandas Series, or any other array-like objects.
- `alternative` (optional): Specifies the alternative hypothesis for the test. It can take the values `'two-sided'` (default), `'larger'`, or `'smaller'`.
- `usevar` (optional): Specifies whether to use the pooled variance or separate variances for the two samples. It can take the values `'pooled'` (default) or `'unequal'`.
- `value` (optional): Specifies the null hypothesis value to test against. By default, it is set to 0.
---

In [20]:
sm.stats.proportions_ztest([906,1413],[2972,2753]) # Right answer, similar to p_value above using dist.norm

(-16.049719603652488, 5.742288777302776e-58)

In [21]:
sm.stats.ttest_ind(dx_females, dx_males) # shouldn't be used in this case because it's for difference mean test only

(-16.420585558984445, 3.0320887866906843e-59, 5723.0)

### Hypothesis Tests One Mean 

Tests of means are similar in many ways to tests of proportions.  Just as with proportions, for comparing means there are one and two-sample tests, z-tests and t-tests, and one-sided and two-sided tests.  As with tests of proportions, one-sample tests of means are not very common, but we illustrate a one sample test in the cell below.  We compare systolic blood pressure to the fixed value 120 (which is the lower threshold for "pre-hypertension"), and find that the mean is significantly different from 120 (the point estimate of the mean is 126).

In [22]:
dx = da[["BPXSY1", "RIDAGEYR", "RIAGENDRx"]].dropna()
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
0,128.0,62,Male
1,146.0,53,Male
2,138.0,78,Male
3,132.0,56,Female
4,100.0,42,Female


In [23]:
dx = dx.loc[(dx.RIDAGEYR >= 40) & (dx.RIDAGEYR <= 50) & (dx.RIAGENDRx == "Male"), :]
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
10,144.0,46,Male
11,116.0,45,Male
20,110.0,49,Male
42,128.0,42,Male
51,118.0,50,Male


In [24]:
print(dx.BPXSY1.mean())

125.86698337292161


In [34]:
# Using z-test

t_stat, p_value = sm.stats.ztest(dx.BPXSY1, value=120)
# Print the results
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 7.469764137102597
P-value: 8.033869113167905e-14


In [33]:
from scipy import stats

# Using t-test

# Perform the one-sample t-test
t_stat, p_value = stats.ttest_1samp(dx.BPXSY1, popmean=120)

# Print the results
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 7.469764137102597
P-value: 4.696272802378146e-13


If you have a large sample but unknown population standard deviation, you should use a t-test. The t-test is a more robust test than the z-test and is less likely to be affected by violations of the assumptions of normality

In both cases, the calculated test statistics indicate a significant difference between the mean systolic blood pressure and the hypothesized value of 120. The extremely small p-values suggest strong evidence against the null hypothesis and support the alternative hypothesis that the mean systolic blood pressure is significantly different from 120.

The p-value is ~0.0000001, which is less than the significance level of 0.05. Therefore, we reject the null hypothesis and conclude that the mean systolic blood pressure is not equal to 120.

### Hypothesis Tests Comparing Means

In the cell below, we carry out a formal test of the null hypothesis that the mean blood pressure for women between the ages of 50 and 60 is equal to the mean blood pressure of men between the ages of 50 and 60.  The results indicate that while the mean systolic blood pressure for men is slightly greater than that for women (129 mm/Hg versus 128 mm/Hg), this difference is not statistically significant. 

There are a number of different variants on the two-sample t-test. Two often-encountered variants are the t-test carried out using the t-distribution, and the t-test carried out using the normal approximation to the reference distribution of the test statistic, often called a z-test.  Below we display results from both these testing approaches.  When the sample size is large, the difference between the t-test and z-test is very small.  

In [35]:
dx = da[["BPXSY1", "RIDAGEYR", "RIAGENDRx"]].dropna()
dx = dx.loc[(dx.RIDAGEYR >= 50) & (dx.RIDAGEYR <= 60), :]
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
1,146.0,53,Male
3,132.0,56,Female
9,178.0,56,Male
15,134.0,57,Female
19,136.0,54,Female


In [36]:
bpx_female = dx.loc[dx.RIAGENDRx=="Female", "BPXSY1"]
bpx_male = dx.loc[dx.RIAGENDRx=="Male", "BPXSY1"]
print(bpx_female.mean(), bpx_male.mean())

127.92561983471074 129.23829787234044


In [41]:
# Perform the two-sample z-test 
print(sm.stats.ztest(bpx_female, bpx_male)) #statmodels.api

(-1.105435895556249, 0.2689707570859362)


In [42]:
# Perform the two-sample t-test 
print(sm.stats.ttest_ind(bpx_female, bpx_male)) 

(-1.105435895556249, 0.26925004137768577, 952.0)


In [43]:
# Perform the two-sample t-test 
stats.ttest_ind(bpx_female, bpx_male, ) #scipy.stats


Ttest_indResult(statistic=-1.105435895556249, pvalue=0.26925004137768577)

Another important aspect of two-sample mean testing is "heteroscedasticity", meaning that the variances within the two groups being compared may be different. While the goal of the test is to compare the means, the variances play an important role in calibrating the statistics (deciding how big the mean difference needs to be to be declared statisitically significant). In the NHANES data, we see that there are moderate differences between the amount of variation in BMI for females and for males, looking within 10-year age bands. In every age band, females having greater variation than males. 

In [46]:
dx = da[["BMXBMI", "RIDAGEYR", "RIAGENDRx"]].dropna()
da["agegrp"] = pd.cut(da.RIDAGEYR, [18, 30, 40, 50, 60, 70, 80])
da.groupby(["agegrp", "RIAGENDRx"])["BMXBMI"].agg(np.std).unstack()

RIAGENDRx,Female,Male
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(18, 30]",7.745893,6.64944
"(30, 40]",8.315608,6.622412
"(40, 50]",8.076195,6.407076
"(50, 60]",7.575848,5.914373
"(60, 70]",7.604514,5.933307
"(70, 80]",6.284968,4.974855


The standard error of the mean difference (e.g. mean female blood pressure minus mean mal blood pressure) can be estimated in at least two different ways. In the statsmodels library, these approaches are referred to as the "pooled" and the "unequal" approach to estimating the variance. If the variances are equal (i.e. there is no heteroscedasticity), then there should be little difference between the two approaches. Even in the presence of moderate heteroscedasticity, as we have here, we can see that the results for the two differences are quite similar. Below we have a loop that considers each 10-year age band and assesses the evidence for a difference in mean BMI for women and for men. The results printed in each row of output are the test-statistic and p-value. 

In [48]:
for k, v in da.groupby("agegrp"):
    # Extract BMI values for females in the current age group
    bmi_female = v.loc[v.RIAGENDRx=="Female", "BMXBMI"].dropna()
    bmi_female = sm.stats.DescrStatsW(bmi_female)  # Create DescrStatsW object for female BMI values
    
    # Extract BMI values for males in the current age group
    bmi_male = v.loc[v.RIAGENDRx=="Male", "BMXBMI"].dropna()
    bmi_male = sm.stats.DescrStatsW(bmi_male)  # Create DescrStatsW object for male BMI values
    
    print(k)  # Print the current age group
    print("pooled: ", sm.stats.CompareMeans(bmi_female, bmi_male).ztest_ind(usevar='pooled'))
    # Perform z-test assuming pooled variance and print the test result
    
    print("unequal: ", sm.stats.CompareMeans(bmi_female, bmi_male).ztest_ind(usevar='unequal'))
    # Perform z-test assuming unequal variance and print the test result
    
    print()

(18, 30]
pooled:  (1.7026932933643306, 0.08862548061449803)
unequal:  (1.7174610823927183, 0.08589495934713169)

(30, 40]
pooled:  (1.4378280405644919, 0.15048285114648174)
unequal:  (1.4437869620833497, 0.1487989105789246)

(40, 50]
pooled:  (2.8933761158070186, 0.003811246059501354)
unequal:  (2.9678691663536725, 0.0029987194174035366)

(50, 60]
pooled:  (3.362108779981383, 0.0007734964571391287)
unequal:  (3.3754943901739387, 0.0007368319423226156)

(60, 70]
pooled:  (3.617240144243268, 0.00029776102103194453)
unequal:  (3.628483094544553, 0.00028509141471492935)

(70, 80]
pooled:  (2.926729252512241, 0.003425469414486057)
unequal:  (2.9377798867692064, 0.0033057163315194853)



The data (`da`) is grouped by the variable "agegrp" using the `groupby()` function. For each group, the body mass index (BMI) of females and males is extracted, and missing values are dropped.


- The `for` loop iterates over each group (`k` represents the group label, and `v` represents the corresponding data for that group) obtained from grouping the data by "agegrp".

- For each group, the BMI values for females (`bmi_female`) and males (`bmi_male`) are extracted based on the condition that "RIAGENDRx" is equal to "Female" or "Male", respectively.

- Missing values (`NaN`) are then dropped using the `dropna()` function, resulting in two separate arrays of BMI values for females and males.

- The BMI arrays are converted into `DescrStatsW` objects using `sm.stats.DescrStatsW()`. This class provides various statistical methods for descriptive analysis, such as computing mean, standard deviation, confidence intervals, and conducting hypothesis tests.

- The `CompareMeans` class from `sm.stats` is used to compare the means of the BMI values between females and males. Two types of tests are conducted: one assuming pooled variance (`usevar='pooled'`) and the other assuming unequal variance (`usevar='unequal'`).

- The `ztest_ind()` function is called on the `CompareMeans` object to perform the independent two-sample z-test for the means. The results include the calculated test statistic and the associated p-value.

- The group label (`k`) is printed, followed by the results of the z-tests using both pooled and unequal variances.

The code essentially performs independent two-sample z-tests to compare the means of BMI values between females and males within each age group. The z-tests are conducted assuming either pooled variance or unequal variance.

By examining the test statistics and p-values, you can assess the statistical significance of the differences in mean BMI between females and males for each age group.