# Inferences about Two Means (Independent Samples)

We use normalized data to make inferences about two sample means (independent samples) using the appropriate statistical hypothesis tests. By using normalized data, we can have a more helpful test that shows us how to compare means when they are based on ranges that their respective values take on.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

file_path = '~/csce5310/houston-aqi-2010-2021.csv'
df = pd.read_csv(file_path)

print (df.head())

   Unnamed: 0  day_of_year  year   latitude  longitude  avg_pm10  aqi_pm10  \
0           0            2  2010  29.733726 -95.257593      13.0        12   
1           1            2  2010  29.733726 -95.257593      13.0        12   
2           2            2  2010  29.733726 -95.257593      13.0        12   
3           3            2  2010  29.733726 -95.257593      13.0        12   
4           4            2  2010  29.733726 -95.257593      13.0        12   

     avg_co  aqi_co    avg_no2  ...    avg_o3  aqi_o3  avg_pm25  aqi_pm25  \
0  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
1  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
2  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
3  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
4  0.325000     6.0  17.258333  ...  0.027294      32      11.6      48.0   

    avg_so2  aqi_so2  avg_humidity  avg_temperature  avg_wind  avg_p

In [2]:
pressure_compare = df[['avg_pressure', 'avg_wind', 'avg_temperature']].values
scaler = MinMaxScaler()
pressure_compare = scaler.fit_transform(pressure_compare)
print(f"Mean Pressure: {round(pressure_compare[:, 0].mean(), 2)}")
print(f"Mean Wind: {round(pressure_compare[:, 1].mean(), 2)}")
print(f"Mean Temperature: {round(pressure_compare[:, 2].mean(), 2)}")

Mean Pressure: 0.43
Mean Wind: 0.29
Mean Temperature: 0.62


In [3]:
n = 100

def get_random_sample(variable):
    return np.random.choice(pressure_compare[:, variable], size=n, replace=False)

## Pressure vs. Wind

### Requirements Check

The requirements are all satisified, as we assume the standard deviations of both samples are unknown, they are random and independently distributed random samples, and n is large (n=100).

### Hypothesis Test

We conduct a hypothesis test that compares the sample means for our pressure variable with the next two variables of wind and temperature. We use a significance level of 0.05 and calculate the degrees of freedom using Welch's statistic (i.e. $(A + B)^2 / ((A^2 / (n_1 - 1)) + (B^2 / (n_2 - 1)))$ where $A = s_1^2 / n_1$ and $B = s_2^2 / n_2$) 

In [4]:
alpha = 0.05
pressure_samples = get_random_sample(0)
wind_samples = get_random_sample(1)

pressure_var = np.var(pressure_samples, ddof=1)
wind_var = np.var(wind_samples, ddof=1)
A, B = pressure_var / n, wind_var / n

degrees_of_freedom = int(((A + B) ** 2) / (((A ** 2) / (n - 1)) + ((B ** 2) / (n - 1))))
critical_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

t_statistic, p_value = stats.ttest_ind(pressure_samples, wind_samples)


print(f'The sample means are {round(pressure_samples.mean(), 2)} (for pressure) and {round(wind_samples.mean(), 2)} (for wind)')
print(f'The degrees of freedom is {degrees_of_freedom}')
print(f'The t-statistic is: {t_statistic}')
print(f'The p-value is: {p_value}')
print(f'The critical value is: {critical_value}')

The sample means are 0.41 (for pressure) and 0.29 (for wind)
The degrees of freedom is 197
The t-statistic is: 4.62339530051101
The p-value is: 6.802338430131697e-06
The critical value is: 1.9720790337760217


Since the p-value is less than our significance level of 0.05, we reject the null hypothesis in favor of the alternative hypothesis, which states that the sample means are statistically different to a significant extent.

### Confidence Interval

In [5]:
margin_of_error = critical_value * np.sqrt(A + B)
mean_difference = pressure_samples.mean() - wind_samples.mean()
print(f'The difference between sample means is: {round(mean_difference, 2)}')
print(f'Confidence Interval: ({round(mean_difference - margin_of_error, 2)}, {round(mean_difference + margin_of_error, 2)})')

The difference between sample means is: 0.11
Confidence Interval: (0.06, 0.16)


Based on the above confidence interval, we can assume with 95% confidence that the actual difference between population means is between 14% and 22% of their normalized values.

## Pressure vs. Temperature

### Hypothesis Test

In [6]:
alpha = 0.05
temperature_samples = get_random_sample(2)

temperature_var = np.var(temperature_samples, ddof=1)
B = temperature_var / n

degrees_of_freedom = int(((A + B) ** 2) / (((A ** 2) / (n - 1)) + ((B ** 2) / (n - 1))))
critical_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

t_statistic, p_value = stats.ttest_ind(pressure_samples, wind_samples)

print(f'The sample means are {round(pressure_samples.mean(), 2)} (for pressure) and {round(temperature_samples.mean(), 2)} (for temperature)')
print(f'The degrees of freedom is {degrees_of_freedom}')
print(f'The t-statistic is: {t_statistic}')
print(f'The p-value is: {p_value}')
print(f'The critical value is: {critical_value}')

The sample means are 0.41 (for pressure) and 0.6 (for temperature)
The degrees of freedom is 181
The t-statistic is: 4.62339530051101
The p-value is: 6.802338430131697e-06
The critical value is: 1.9731570421553688


Since the p-value is less than our significance level of 0.05, we reject the null hypothesis in favor of the alternative hypothesis, which states that the sample means are statistically different to a significant extent.

### Confidence Interval

In [7]:
margin_of_error = critical_value * np.sqrt(A + B)
mean_difference = abs(pressure_samples.mean() - temperature_samples.mean())
print(f'The difference between sample means is: {round(mean_difference, 2)}')
print(f'Confidence Interval: ({round(mean_difference - margin_of_error, 2)}, {round(mean_difference + margin_of_error, 2)})')

The difference between sample means is: 0.2
Confidence Interval: (0.14, 0.25)


Based on the above confidence interval, we can assume with 95% confidence that the actual difference between population means is between 9% and 20% of their normalized values.