# Inferences about Two Means (Independent Samples)

We use normalized data to make inferences about two sample means (independent samples) using the appropriate statistical hypothesis tests. By using normalized data, we can have a more helpful test that shows us how to compare means when they are based on ranges of their respective values.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

file_path = '~/Documents/UNT/csce5310/houston-aqi-2010-2021.csv'
df = pd.read_csv(file_path)

In [2]:
pressure_compare = df[['avg_pressure', 'avg_wind', 'avg_temperature']].values
scaler = MinMaxScaler()
pressure_compare = scaler.fit_transform(pressure_compare)
print(f"Mean Pressure: {round(pressure_compare[:, 0].mean(), 2)}")
print(f"Mean Wind: {round(pressure_compare[:, 1].mean(), 2)}")
print(f"Mean Temperature: {round(pressure_compare[:, 2].mean(), 2)}")

Mean Pressure: 0.43
Mean Wind: 0.29
Mean Temperature: 0.62


In [4]:
n = 100

def get_random_sample(variable):
    return np.random.choice(pressure_compare[:, variable], size=n, replace=False)

## Pressure vs. Wind

### Requirements Check

The requirements are all satisified, as we assume the standard deviations of both samples are unknown, they are random and independently distributed random samples, and n is large (n=100).

### Hypothesis Test

We conduct a hypothesis test that compares the sample means for our pressure variable with the next two variables of wind and temperature. We use a significance level of 0.05 and calculate the degrees of freedom using Welch's statistic (i.e. $(A + B)^2 / ((A^2 / (n_1 - 1)) + (B^2 / (n_2 - 1)))$ where $A = s_1^2 / n_1$ and $B = s_2^2 / n_2$) 

In [5]:
alpha = 0.05
pressure_samples = get_random_sample(0)
wind_samples = get_random_sample(1)
pressure_var = np.var(pressure_samples, ddof=1)
wind_var = np.var(wind_samples, ddof=1)
A, B = pressure_var / n, wind_var / n
degrees_of_freedom = int(((A + B) ** 2) / (((A ** 2) / (n - 1)) + ((B ** 2) / (n - 1))))
critical_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)
t_statistic, p_value = stats.ttest_ind(pressure_samples, wind_samples)
print(f'The sample means are {round(pressure_samples.mean(), 2)} (for pressure) and {round(wind_samples.mean(), 2)} (for wind)')
print(f'The degrees of freedom is {degrees_of_freedom}')
print(f'The t-statistic is: {t_statistic}')
print(f'The p-value is: {p_value}')
print(f'The critical value is: {critical_value}')

The sample means are 0.41 (for pressure) and 0.26 (for wind)
The degrees of freedom is 196
The t-statistic is: 6.758255055951963
The p-value is: 1.52551127903624e-10
The critical value is: 1.9721412216594967


Since the p-value is less than our significance level of 0.05, we reject the null hypothesis in favor of the alternative hypothesis, which states that the sample means are statistically significantly different.

### Confidence Interval

In [6]:
margin_of_error = critical_value * np.sqrt(A + B)
mean_difference = pressure_samples.mean() - wind_samples.mean()
print(f'The difference between sample means is: {round(mean_difference, 2)}')
print(f'Confidence Interval: ({round(mean_difference - margin_of_error, 2)}, {round(mean_difference + margin_of_error, 2)})')

The difference between sample means is: 0.15
Confidence Interval: (0.11, 0.19)


Based on the above confidence interval, we can assume with 95% confidence that the actual difference between population means is between 11% and 19% of their normalized values.

## Pressure vs. Temperature

### Hypothesis Test

In [8]:
alpha = 0.05
pressure_samples = get_random_sample(0)
temperature_samples = get_random_sample(2)
pressure_var = np.var(pressure_samples, ddof=1)
temperature_var = np.var(temperature_samples, ddof=1)
A, B = pressure_var / n, temperature_var / n
degrees_of_freedom = int(((A + B) ** 2) / (((A ** 2) / (n - 1)) + ((B ** 2) / (n - 1))))
critical_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)
t_statistic, p_value = stats.ttest_ind(pressure_samples, wind_samples)
print(f'The sample means are {round(pressure_samples.mean(), 2)} (for pressure) and {round(temperature_samples.mean(), 2)} (for temperature)')
print(f'The degrees of freedom is {degrees_of_freedom}')
print(f'The t-statistic is: {t_statistic}')
print(f'The p-value is: {p_value}')
print(f'The critical value is: {critical_value}')

The sample means are 0.42 (for pressure) and 0.64 (for temperature)
The degrees of freedom is 196
The t-statistic is: 6.758068582477179
The p-value is: 1.527106798249981e-10
The critical value is: 1.9721412216594967


Since the p-value is less than our significance level of 0.05, we reject the null hypothesis in favor of the alternative hypothesis for this two-tailed test, which states that the sample means are statistically different to a significant extent.

### Confidence Interval

In [9]:
margin_of_error = critical_value * np.sqrt(A + B)
mean_difference = abs(pressure_samples.mean() - temperature_samples.mean())
print(f'The difference between sample means is: {round(mean_difference, 2)}')
print(f'Confidence Interval: ({round(mean_difference - margin_of_error, 2)}, {round(mean_difference + margin_of_error, 2)})')

The difference between sample means is: 0.22
Confidence Interval: (0.17, 0.27)


Based on the above confidence interval, we can assume with 95% confidence that the actual difference between population means is between 17% and 27% of their normalized values.