# Mean and Proportion Hypothesis Tests

In [1]:
import numpy as np
import pandas as pd
import scipy.stats

file_path = '~/csce5310/houston-aqi-2010-2021.csv'
df = pd.read_csv(file_path)

print (df.head())

   Unnamed: 0  day_of_year  year   latitude  longitude  avg_pm10  aqi_pm10  \
0           0            2  2010  29.733726 -95.257593      13.0        12   
1           1            2  2010  29.733726 -95.257593      13.0        12   
2           2            2  2010  29.733726 -95.257593      13.0        12   
3           3            2  2010  29.733726 -95.257593      13.0        12   
4           4            2  2010  29.733726 -95.257593      13.0        12   

     avg_co  aqi_co    avg_no2  ...    avg_o3  aqi_o3  avg_pm25  aqi_pm25  \
0  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
1  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
2  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
3  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
4  0.325000     6.0  17.258333  ...  0.027294      32      11.6      48.0   

    avg_so2  aqi_so2  avg_humidity  avg_temperature  avg_wind  avg_p

In [2]:
print(f"Mean Pressure: {df['avg_pressure'].mean()}")
print(f"Mean Wind: {df['avg_wind'].mean()}")
print(f"Mean Temperature: {df['avg_temperature'].mean()}")
print(f"Mean Humidity:  {df['avg_humidity'].mean()}")

Mean Pressure: 1017.8874957186971
Mean Wind: 5.238881443739425
Mean Temperature: 68.73953821573603
Mean Humidity:  65.6445454107445


## Pressure

We first start by picking 100 random samples for the pressure variable from our dataset.

In [3]:
n = 100
random_indices = np.random.choice(df.index, n, replace=False)
pressure_samples = df.loc[random_indices, 'avg_pressure']
pressure_samples

2790    1011.541667
2046    1007.478261
1649    1015.333333
2060    1011.541667
3144    1017.583333
           ...     
418     1010.750000
4630    1008.750000
2148    1016.583333
4459    1018.041667
3722    1026.291667
Name: avg_pressure, Length: 100, dtype: float64

### Proportion Test

We conduct the proportion test for the samples where p=0.5. We first check the requirements are satistified:

#### Requirements Check

In [17]:
p = 0.5
q = 1 - p
print(f'np >= 5 ? {len(pressure_samples) * p >= 5}')
print(f'nq >= 5 ? {len(pressure_samples) * q >= 5}')

np >= 5 ? True
nq >= 5 ? True


#### Hypothesis Test

We are going to test the claim that greater than 50% of the pressure samples are greater than 1017 units with a 99% confidence level by conducting a right-tailed hypothesis test, where the null hypothesis is the statement of equality about the proportion of pressure samples that are greater than 1017 units is equal to 50%. We first calculate the test statistic for the z-index using numpy, and find the value of the survival function using scipy.stats.norm.

In [9]:
alpha = 0.01
p_over_1017 = np.where(pressure_samples > 1017, 1, 0).sum() / 100.
z = (p_over_1017 - p) / np.sqrt(p * q / n)
print(f'The z-index is: {z}')
print(f'The p-value is: {scipy.stats.norm.sf(z)}')
print(f'The critical value is: {scipy.stats.norm.ppf(1 - alpha)}')
print(f'Reject null hypothesis? {scipy.stats.norm.sf(z) <= alpha}')

The z-index is: -0.7999999999999996
The p-value is: 0.7881446014166031
The critical value is: 2.3263478740408408
Reject null hypothesis? False


Since the p-value is greater than alpha of 0.01 for this right-tailed test, we fail to reject the null hypothesis and conclude that the proportion of pressure samples greater than 1017 is equal to 50%.

#### Confidence Interval

We calculate the confidence interval for obtaining an estimate over the population proportion we have calculated using the chosen significance level of 0.01.

In [6]:
margin_of_error = scipy.stats.norm.ppf(1 - alpha / 2) * np.sqrt(p_over_1017 * (1 - p_over_1017) / n)
print(f'The proportion of pressure samples greater than 1017 unit is: {p_over_1017}')
print(f'The confidence interval is: ({round(p_over_1017 - margin_of_error, 2)}, {round(p_over_1017 + margin_of_error, 2)})')

The proportion of pressure samples greater than 1017 unit is: 0.46
The confidence interval is: (0.33, 0.59)


### Mean Test

We perform the following mean hypothesis test, where we formulate a hypothesis about the mean of the pressure samples being 1015 and test our hypothesis using a standard numerical t-test. We are going to assume that the standard deviation of the population is also not know, as is commonly the case when conducting a test about the population mean interval.

#### Requirements Check

In [7]:
print(f'n > 30 ? {n > 30}')

n > 30 ? True


#### Hypothesis Test

We conduct a two-tailed hypothesis test for the sample means being equivalent to 1017 units with 95% confidence level.

In [13]:
alpha = 0.05
degrees_of_freedom = len(pressure_samples) - 1
t_statistic, p_value = scipy.stats.ttest_1samp(pressure_samples, 1017)
critical_value = scipy.stats.t.ppf(1 - alpha / 2, degrees_of_freedom)
print(f'The t-statistic is: {t_statistic}')
print(f'The p-value is: {p_value}')
print(f'The critical value is: {critical_value}')
print(f'Reject null hypothesis? {p_value <= alpha}')

The t-statistic is: 1.0708520501921184
The p-value is: 0.28683969815024624
The critical value is: 1.9842169515086827
Reject null hypothesis? False


Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis and conclude that the mean of the pressure samples is 1017 units.

#### Confidence Interval

We construct a confidence interval for a 95% confidence level for the estimation of the population mean.

In [16]:
margin_of_error = critical_value * pressure_samples.std() / np.sqrt(n)
print(f'Confidence interval: ({pressure_samples.mean() - margin_of_error}, {pressure_samples.mean() + margin_of_error})')

Confidence interval: (1016.4244557388837, 1018.9251094811162)


We can see that the confidence interval overlaps with our hypothesis about the sample mean that we failed to reject in the step above.