# Activity: Explore hypothesis testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [1]:
# Import data manipulation libraries
import pandas as pd
import numpy as np

# Import statistical functions
from scipy import stats

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure plots display in notebook environments
%matplotlib inline

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
aqi = pd.read_csv('c4_epa_air_quality.csv')

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [3]:
# Display the first few rows of the dataset
print(aqi.head())

# Get general information about the dataset: column names, types, non-null counts
print(aqi.info())

# Get summary statistics for numeric columns (like AQI and arithmetic mean)
print(aqi.describe())

# Check for missing values in each column
print(aqi.isnull().sum())

# Check the number of unique states, counties, and cities
print("Unique states:", aqi['state_name'].nunique())
print("Unique counties:", aqi['county_name'].nunique())
print("Unique cities:", aqi['city_name'].nunique())

# Explore which states are represented and how many rows per state
print(aqi['state_name'].value_counts())

# Check how many rows are available for California and Los Angeles County
print("Rows for California:", aqi[aqi['state_name'] == 'California'].shape[0])
print("Rows for Los Angeles County in California:", aqi[(aqi['state_name'] == 'California') & (aqi['county_name'] == 'Los Angeles')].shape[0])

   Unnamed: 0  date_local    state_name   county_name      city_name  \
0           0  2018-01-01       Arizona      Maricopa        Buckeye   
1           1  2018-01-01          Ohio       Belmont      Shadyside   
2           2  2018-01-01       Wyoming         Teton  Not in a city   
3           3  2018-01-01  Pennsylvania  Philadelphia   Philadelphia   
4           4  2018-01-01          Iowa          Polk     Des Moines   

                                     local_site_name   parameter_name  \
0                                            BUCKEYE  Carbon monoxide   
1                                          Shadyside  Carbon monoxide   
2  Yellowstone National Park - Old Faithful Snow ...  Carbon monoxide   
3                             North East Waste (NEW)  Carbon monoxide   
4                                          CARPENTER  Carbon monoxide   

    units_of_measure  arithmetic_mean  aqi  
0  Parts per million         0.473684    7  
1  Parts per million         0.263158 

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referring to the material on descriptive statisics.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider using `pandas` or `numpy` to explore the `aqi` dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

Any of the following functions may be useful:
- `pandas`: `describe()`,`value_counts()`,`shape()`, `head()`
- `numpy`: `unique()`,`mean()`
    
</details>

#### **Question 1: From the preceding data exploration, what do you recognize?**

Data Sufficiency for Hypothesis Tests:

The dataset includes AQI data from 260 observations across 52 states, with California having the highest count (66 rows).

Los Angeles County within California has 14 rows, which should be sufficient to perform a one-sample t-test or comparison with other counties in California, although larger samples would improve power.

New York (10 rows) and Ohio (12 rows) both have moderate sample sizes, making a two-sample comparison feasible, especially using Welch’s t-test which accounts for unequal variances.

Michigan has 9 observations, which is borderline but still usable for a one-sample t-test to determine whether its mean AQI is significantly different from 10.

Completeness:

There are no missing values in the critical columns for this analysis (state_name, county_name, aqi, arithmetic_mean), which supports reliable hypothesis testing.

A minor amount of missing data in local_site_name (3 missing), but that won't impact our AQI analysis.

Data Characteristics:

The average AQI across all records is approximately 6.76, with a standard deviation of about 7.06.

AQI values range from 0 to 50, covering a meaningful spectrum of air quality situations.

Variable Understanding:

The column aqi is the variable of interest for hypothesis testing, representing air quality. arithmetic_mean refers to pollutant concentrations, which may be related but is not the direct target of the policy-related questions.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [4]:
# Subset for Los Angeles County in California
la_df = aqi[(aqi['state_name'] == 'California') & (aqi['county_name'] == 'Los Angeles')]

# Subset for the rest of California (excluding Los Angeles County)
rest_of_ca_df = aqi[(aqi['state_name'] == 'California') & (aqi['county_name'] != 'Los Angeles')]

# Check the sizes of each group
print(f"Los Angeles County rows: {len(la_df)}")
print(f"Rest of California rows: {len(rest_of_ca_df)}")

Los Angeles County rows: 14
Rest of California rows: 52


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for Los Angeles, and one for all other California observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `county_name` of `Los Angeles`. For your second dataframe, filter to `state_name` of `Calfornia` and `county_name` not equal to `Los Angeles`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [5]:
from scipy import stats

# Extract the AQI values
la_aqi = la_df['aqi']
rest_ca_aqi = rest_of_ca_df['aqi']

# Set the significance level
alpha = 0.05

# Perform Welch’s t-test
t_stat, p_value = stats.ttest_ind(la_aqi, rest_ca_aqi, equal_var=False)

# Print the results
print("T-statistic:", t_stat)
print("P-value:", p_value)

# Draw conclusion
if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference in mean AQI.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference in mean AQI.")

T-statistic: 2.1107010796372014
P-value: 0.049839056842410995
Reject the null hypothesis. There is a statistically significant difference in mean AQI.


#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [6]:
from scipy import stats

# Compute the p-value using Welch’s t-test
t_stat, p_value = stats.ttest_ind(la_df['aqi'], rest_of_ca_df['aqi'], equal_var=False)

# Output the test statistic and p-value
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 2.1107010796372014
P-value: 0.049839056842410995


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from our "Los Angeles" dataframe, and b is the aqi column from the "Other California" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  Be sure to set `equal_var` = False.

</details>

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

The computed p-value is 0.04984.

Interpretation:

Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.

This indicates that there is a statistically significant difference in the mean AQI between Los Angeles County and the rest of California. Therefore, a metropolitan-focused approach may be justified, as Los Angeles appears to have a different air quality profile compared to other counties in the state.

Would you like help visualizing this difference with a boxplot or histogram?

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [7]:
# Subset for New York
ny_df = aqi[aqi['state_name'] == 'New York']

# Subset for Ohio
ohio_df = aqi[aqi['state_name'] == 'Ohio']

# Check the sizes of each group
print(f"New York rows: {len(ny_df)}")
print(f"Ohio rows: {len(ohio_df)}")

New York rows: 10
Ohio rows: 12


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the materials on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for New York, and one for Ohio observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `state_name` of `New York`. For your second dataframe, filter to `state_name` of `Ohio`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [8]:
from scipy import stats

# Extract the AQI values for New York and Ohio
ny_aqi = ny_df['aqi']
ohio_aqi = ohio_df['aqi']

# Set the significance level
alpha = 0.05

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(ny_aqi, ohio_aqi, equal_var=False, alternative='less')

# Print the results
print("T-statistic:", t_stat)
print("P-value:", p_value)

# Draw conclusion
if p_value < alpha:
    print("Reject the null hypothesis. New York has a statistically lower AQI than Ohio.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference in AQI.")

T-statistic: -2.025951038880333
P-value: 0.030446502691934697
Reject the null hypothesis. New York has a statistically lower AQI than Ohio.


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from the "New York" dataframe, an b is the aqi column from the "Ohio" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_ind`. Be sure to include `alternative = less` as part of your code.  

</details>

#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

P-value: 0.0304

This p-value indicates that the difference in AQI between New York and Ohio is statistically significant at the 5% significance level. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that New York has a lower AQI than Ohio.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [9]:
# Subset for Michigan
michigan_df = aqi[aqi['state_name'] == 'Michigan']

# Check the number of rows for Michigan
print(f"Michigan rows: {len(michigan_df)}")

# Extract the AQI values for Michigan
michigan_aqi = michigan_df['aqi']

# Check the descriptive statistics for Michigan's AQI
print(michigan_aqi.describe())

Michigan rows: 9
count     9.000000
mean      8.111111
std       3.257470
min       2.000000
25%       7.000000
50%       8.000000
75%      10.000000
max      13.000000
Name: aqi, dtype: float64


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating one dataframe which only includes Michigan.
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [10]:
# Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

# Subset for Michigan
michigan_df = aqi[aqi['state_name'] == 'Michigan']

# Extract the AQI values for Michigan
michigan_aqi = michigan_df['aqi']

# Set the significance level
alpha = 0.05

# Perform a one-sample t-test (testing if the mean AQI of Michigan is greater than 10)
from scipy import stats

# Perform the t-test
t_stat, p_value = stats.ttest_1samp(michigan_aqi, 10)

# Output the test statistic and p-value
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: -1.7395913343286131
P-value: 0.12011896137197813


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a one-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_1samp)`, you are comparing the aqi column from your Michigan data relative to 10, the new policy threshold.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_1samp`. Be sure to include `alternative = greater` as part of your code.  

</details>

#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

The P-value for hypothesis 3 is 0.1201.

Interpretation:
Since the P-value (0.1201) is greater than the significance level (α = 0.05), we fail to reject the null hypothesis.

This indicates that there is no statistically significant evidence to support the claim that the mean AQI in Michigan is greater than 10.

Thus, Michigan is not significantly affected by the new policy, as its mean AQI does not exceed the threshold of 10 based on the test.

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

Yes, the results of the hypothesis test showed that the AQI in Los Angeles County is statistically different from the rest of California. The p-value for the two-sample t-test was 0.0498, which is less than the significance level of 0.05. Therefore, we rejected the null hypothesis and concluded that there is a significant difference in the mean AQI between Los Angeles County and the rest of California.

#### **Question 6. Did New York or Ohio have a lower AQI?**

Based on the results of the hypothesis test, New York had a statistically lower AQI than Ohio. The p-value for the one-tailed t-test was 0.0304, which is less than the significance level of 0.05. Thus, we rejected the null hypothesis and concluded that New York has a statistically lower AQI than Ohio.

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



No, Michigan will not be significantly affected by the new policy based on the results of the one-sample t-test. The p-value was 0.1201, which is greater than the significance level of 0.05. Therefore, we failed to reject the null hypothesis and concluded that Michigan's mean AQI is not significantly greater than 10. Consequently, Michigan is unlikely to be impacted by the policy targeting states with an AQI of 10 or greater.

# Conclusion

**What are key takeaways from this lab?**

Statistical Testing for Decision-Making: This lab reinforced how hypothesis testing, particularly t-tests, can be used to analyze data and make informed decisions. By comparing AQI values across different counties, states, and cities, we were able to determine if observed differences were statistically significant or if they could be attributed to random variation.

Focus on Regions with Significant Differences: The analysis of AQI across various locations, including Los Angeles County vs. the rest of California, and New York vs. Ohio, highlighted regions with significantly different air quality, which is critical for policy implementation and resource allocation.

Importance of P-values: The use of p-values allowed us to quantitatively assess whether the differences in AQI between the regions were meaningful. For instance, a p-value below 0.05 indicated significant differences, which would justify further actions or changes, while higher p-values suggested no significant difference.

Real-World Implications: The results have direct implications for decisions like selecting new regional offices based on air quality considerations, and evaluating the impact of new policies that are based on environmental thresholds.

**What would you consider presenting to your manager as part of your findings?**

Los Angeles vs. Rest of California: The AQI in Los Angeles County is statistically different from the rest of California, indicating that air quality in Los Angeles warrants special attention. If your company’s regional office is to be located there, it may require additional resources or strategies to mitigate pollution-related issues.

New York vs. Ohio: New York has a statistically lower AQI compared to Ohio. If environmental quality is a deciding factor for your next regional office, New York may be a better choice in terms of air quality.

Michigan’s Policy Impact: The new policy targeting states with a mean AQI of 10 or greater may not affect Michigan. The statistical test did not find evidence of Michigan’s AQI being significantly higher than 10, so Michigan may not need to make immediate adjustments based on this policy.

Data-driven Recommendations: Based on the findings, it would be advisable to focus efforts on areas with significantly high AQI, like Los Angeles, or areas with significantly lower AQI, like New York, in your resource planning and policy implementation.

**What would you convey to external stakeholders?**

Environmental Quality and Its Importance: Share that air quality significantly varies by region, and decisions about regional offices, policies, or investments should consider environmental factors. This is crucial for both employee well-being and long-term operational success.

Implications of New Policies: For external stakeholders, including policymakers and regulatory bodies, the results can inform discussions on environmental standards and regulations. For example, the policy targeting states with an AQI of 10 or greater may need to consider exceptions or different implementation timelines for regions like Michigan.

Transparency in Data: It’s important to communicate that the decisions are based on robust statistical analyses, which helps in creating trust with external stakeholders who rely on accurate and transparent data when making decisions about the environment and public health.

Further Investigations: While the analysis has provided insights into AQI differences, further research may be needed in specific regions with borderline AQI values, such as Michigan, to refine future policies or business strategies.