# Activity: Explore hypothesis testing

## Introduction

ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For this analysis, we'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, we'll use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
from scipy import stats 

#### Load Dataset

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.
aqi = pd.read_csv('c4_epa_air_quality.csv')

## Step 2: Data Exploration

In [3]:
# Explore your dataframe `aqi` here:
aqi.describe(include='all')

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
count,260.0,260,260,260,260,257,260,260,260.0,260.0
unique,,1,52,149,190,253,1,1,,
top,,2018-01-01,California,Los Angeles,Not in a city,Kapolei,Carbon monoxide,Parts per million,,
freq,,260,66,14,21,2,260,260,,
mean,129.5,,,,,,,,0.403169,6.757692
std,75.199734,,,,,,,,0.317902,7.061707
min,0.0,,,,,,,,0.0,0.0
25%,64.75,,,,,,,,0.2,2.0
50%,129.5,,,,,,,,0.276315,5.0
75%,194.25,,,,,,,,0.516009,9.0


In [4]:
aqi['state_name'].value_counts()

state_name
California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska         

There are 260 elements for each column. The 'AQI' column has a mean of 6.75 with a standard deviation of 7.06. The worst 'AQI' value found was 50. Both Ohio and New York have a high number of values to work with.

## Step 3. Statistical Tests

It's always good to have a guide while working. For that reason, I'll recall the steps to conduct a statistical test:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: Let's consider a metropolitan-focused approach. Within California, I want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

In [5]:
# Create dataframes for Los angeles
aqi_LA = aqi[aqi['county_name']=='Los Angeles']
LA_mean = aqi_LA['aqi'].mean() 
print(f"Mean: {LA_mean}")
print(f"Shape: {aqi_LA.shape}")

Mean: 16.285714285714285
Shape: (14, 10)


In [6]:
# Create dataframes for California
aqi_cal = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]
cal_mean = aqi_cal['aqi'].mean()
print(f"Mean: {cal_mean}")
print(f"Shape: {aqi_cal.shape}")

Mean: 11.0
Shape: (52, 10)


#### Formulate hypothesis:

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.

#### Set the significance level:

In [7]:
significance_level = 0.05

#### Compute the P-value

In [8]:
stats.ttest_ind(a=aqi_LA['aqi'], b=aqi_cal['aqi'], equal_var=False)

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

The p-value is 0.049. The p-value is less than the significance level of 5% chosen, so I reject the null hypothesis. Therefore, a metropolitan strategy may make sense in this case.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

In [9]:
# New york dataframe
ny_aqi = aqi[aqi['state_name']=='New York']
ny_mean = ny_aqi['aqi'].mean()

# Ohio dataframe
ohio_aqi = aqi[aqi['state_name']=='Ohio']
ohio_mean = ohio_aqi['aqi'].mean()

**Formulate hypotheses:**
*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.

Significance Level (remains at 5%)

#### Compute the P-value

In [10]:
# Compute your p-value
# In the alternative parameter we use 'less' because we want to know if NY is below Ohio. 
stats.ttest_ind(a=ny_aqi['aqi'], b=ohio_aqi['aqi'], alternative='less', equal_var=False)

TtestResult(statistic=-2.025951038880333, pvalue=0.030446502691934683, df=15.036745051598716)

The p-value is 0.03, so I will reject the null hypothesis. Therefore, New York's AQI mean is lower than Ohio's.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

In [11]:
# Michigan dataframe
michigan_aqi = aqi[aqi['state_name']=='Michigan']

**Formulate hypotheses:**
*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [12]:
# Conduct a one-sample t-test
# Compute your p-value here
# In the alternative parameter we use 'greater' because we want to know if michigan mean is greater than 10.
stats.ttest_1samp(michigan_aqi['aqi'], 10, alternative='greater')

TtestResult(statistic=-1.7395913343286131, pvalue=0.9399405193140109, df=8)

The p-value is 0.93, so I failed to reject the null hypothesis. Therefore, Michigan's AQI mean is less or equal to 10.

## Results and Evaluation

Now that the statistical tests are complete, let's double-check the results gathered.

* The results indicated that Los Angeles has a higher AQI mean than the rest of California's counties. In this case, it would be good for the EPA to focus on this county. Maybe they can try to find out which factors are causing this. 

* New York has a lower AQI mean, which indicates that opening an office in Ohio could be more profitable, so they can be more involved in the day-to-day activities of the state. 

* It seems that Michigan won't be affected by the new policies. 