# Mini Project 5-5 Explore Hypothesis Testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [39]:
# Import relevant packages

import pandas as pd
import numpy as np
from scipy import stats


You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [40]:
# IMPORT YOUR DATA
aqi = pd.read_csv("c4_epa_air_quality.csv")
aqi = aqi.dropna()



## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [41]:
# Use head() to show a sample of data
aqi.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


In [42]:
# check varibles
print (aqi.shape)

(257, 10)


In [43]:
# Use describe() to summarize AQI

aqi.describe()

Unnamed: 0.1,Unnamed: 0,arithmetic_mean,aqi
count,257.0,257.0,257.0
mean,129.766537,0.404578,6.782101
std,74.675286,0.319311,7.091422
min,0.0,0.0,0.0
25%,66.0,0.2,2.0
50%,130.0,0.278947,5.0
75%,194.0,0.516667,9.0
max,259.0,1.921053,50.0


In [44]:
# For a more thorough examination of observations by state use values_counts()
aqi.value_counts()

Unnamed: 0  date_local  state_name      county_name  city_name                                  local_site_name                 parameter_name   units_of_measure   arithmetic_mean  aqi
0           2018-01-01  Arizona         Maricopa     Buckeye                                    BUCKEYE                         Carbon monoxide  Parts per million  0.473684         7      1
131         2018-01-01  Arizona         Maricopa     Phoenix                                    CENTRAL PHOENIX                 Carbon monoxide  Parts per million  1.110526         27     1
165         2018-01-01  Utah            Weber        Ogden                                      Ogden                           Carbon monoxide  Parts per million  0.326316         7      1
166         2018-01-01  New Jersey      Hudson       Jersey City                                Jersey City                     Carbon monoxide  Parts per million  0.133333         3      1
167         2018-01-01  New York        New York     Ne

#### **Question 1: From the preceding data exploration, what do you recognize?**

A:

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [45]:
# Create dataframes for each sample being compared in your test

california_aqi = aqi[aqi['state_name'] == 'California']

la_county_aqi = california_aqi[california_aqi['county_name'] == 'Los Angeles']

rest_of_CA = california_aqi[california_aqi['county_name'] != 'Los Angeles']

print(la_county_aqi)
print(rest_of_CA)

     Unnamed: 0  date_local  state_name  county_name         city_name  \
33           33  2018-01-01  California  Los Angeles         Lancaster   
42           42  2018-01-01  California  Los Angeles     Santa Clarita   
61           61  2018-01-01  California  Los Angeles          Pasadena   
76           76  2018-01-01  California  Los Angeles       Los Angeles   
109         109  2018-01-01  California  Los Angeles       Los Angeles   
110         110  2018-01-01  California  Los Angeles       Los Angeles   
119         119  2018-01-01  California  Los Angeles            Reseda   
132         132  2018-01-01  California  Los Angeles           Compton   
163         163  2018-01-01  California  Los Angeles             Azusa   
172         172  2018-01-01  California  Los Angeles       Pico Rivera   
177         177  2018-01-01  California  Los Angeles        Long Beach   
189         189  2018-01-01  California  Los Angeles            Pomona   
233         233  2018-01-01  Californi

In [46]:
# Check head
la_county_aqi.head()
rest_of_CA.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
16,16,2018-01-01,California,San Bernardino,Ontario,Ontario Near Road (Etiwanda),Carbon monoxide,Parts per million,0.747368,11
18,18,2018-01-01,California,Sacramento,Arden-Arcade,Sacramento-Del Paso Manor,Carbon monoxide,Parts per million,0.752632,16
26,26,2018-01-01,California,Orange,La Habra,La Habra,Carbon monoxide,Parts per million,0.673684,13
27,27,2018-01-01,California,Alameda,Not in a city,Berkeley- Aquatic Park,Carbon monoxide,Parts per million,1.088889,15
34,34,2018-01-01,California,Fresno,Fresno,Fresno - Garland,Carbon monoxide,Parts per million,1.0,15


#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [47]:
# For this analysis, the significance level is 5%

# n < 30, t-distribution: sample a, sample b, assume not same variance.

stats.ttest_ind(a=la_county_aqi['aqi'], b=rest_of_CA['aqi'], equal_var=False)

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [48]:
# Compute your p-value here

# Extracting pvalue and make the test

statistic, pvalue = stats.ttest_ind(a=la_county_aqi['aqi'], b=rest_of_CA['aqi'], equal_var=False)

print ("pvalue:",pvalue)
print()

if pvalue < 0.05:
    
    print('pvalue < 0.05, Reject Ho.')          # Ha: There is a difference in the mean
else:
    print('pvalue > 0.05, Fail to reject Ho.')  # Ho: There is no difference in the mean

pvalue: 0.049839056842410995

pvalue < 0.05, Reject Ho.


#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

In [49]:
# Extracting pvalue and make the test


A: We Reject the null hypothesis. This means that there is a difference in the AQI of LA County vs the rest of the state

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [50]:
# Create dataframes for each sample being compared in your test

New_York = aqi[aqi['state_name'] == 'New York']
Ohio = aqi[aqi['state_name'] == 'Ohio']




print(New_York)
print(Ohio)

     Unnamed: 0  date_local state_name county_name      city_name  \
90           90  2018-01-01   New York        Erie    Cheektowaga   
113         113  2018-01-01   New York       Bronx       New York   
124         124  2018-01-01   New York      Monroe      Rochester   
167         167  2018-01-01   New York    New York       New York   
173         173  2018-01-01   New York      Queens       New York   
182         182  2018-01-01   New York      Queens       New York   
184         184  2018-01-01   New York     Steuben  Not in a city   
195         195  2018-01-01   New York        Erie        Buffalo   
196         196  2018-01-01   New York      Monroe      Rochester   
234         234  2018-01-01   New York      Albany         Albany   

              local_site_name   parameter_name   units_of_measure  \
90          Buffalo Near-Road  Carbon monoxide  Parts per million   
113           PFIZER LAB SITE  Carbon monoxide  Parts per million   
124               ROCHESTER 2  Ca

In [51]:
# Check head

New_York.head()
Ohio.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
12,12,2018-01-01,Ohio,Hamilton,Cincinnati,Taft NCore,Carbon monoxide,Parts per million,0.252632,3
22,22,2018-01-01,Ohio,Stark,Canton,Canton,Carbon monoxide,Parts per million,0.394737,6
51,51,2018-01-01,Ohio,Summit,Akron,NIHF STEM MS,Carbon monoxide,Parts per million,0.083333,3
59,59,2018-01-01,Ohio,Cuyahoga,Cleveland,GT Craig NCore,Carbon monoxide,Parts per million,0.25,3


**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [52]:
# Compute your p-value here

statistic, pvalue = stats.ttest_ind(a=New_York['arithmetic_mean'], 
                                    b=Ohio['arithmetic_mean'], 
                                    equal_var=False)

#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

In [53]:
# Your code here.

if statistic < 0:
    adjusted_pvalue = pvalue / 2
else:
    adjusted_pvalue = 1 - (pvalue / 2)

# Extracting pvalue and making the test
print("T-statistic:", statistic)
print("Adjusted p-value (left-tailed):", adjusted_pvalue)
print()

if adjusted_pvalue < 0.05:
    
    print('pvalue < 0.05, Reject Ho.')          
else:
    print('pvalue > 0.05, Fail to reject Ho.')  

T-statistic: 0.29374585904584777
Adjusted p-value (left-tailed): 0.6136693659905865

pvalue > 0.05, Fail to reject Ho.


A: Reject Ho. This means that New York AQI is below that of Ohios

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [54]:
# Create dataframes for each sample being compared in your test

Michigan = aqi[aqi['state_name'] == 'Michigan']




print(Michigan)


     Unnamed: 0  date_local state_name county_name      city_name  \
65           65  2018-01-01   Michigan       Wayne        Livonia   
122         122  2018-01-01   Michigan       Wayne        Detroit   
123         123  2018-01-01   Michigan       Wayne        Detroit   
129         129  2018-01-01   Michigan       Wayne        Detroit   
192         192  2018-01-01   Michigan       Wayne     Allen Park   
207         207  2018-01-01   Michigan       Wayne  Not in a city   
226         226  2018-01-01   Michigan        Kent   Grand Rapids   
242         242  2018-01-01   Michigan       Wayne        Detroit   
248         248  2018-01-01   Michigan       Wayne        Detroit   

              local_site_name   parameter_name   units_of_measure  \
65                 LIVONIA-NR  Carbon monoxide  Parts per million   
122               West corner  Carbon monoxide  Parts per million   
123  MARK TWAIN MIDDLE SCHOOL  Carbon monoxide  Parts per million   
129                  ELIZA-NR  Ca

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [55]:
statistic, pvalue = stats.ttest_1samp(a=Michigan['arithmetic_mean'], popmean=10)


if statistic > 0:
    adjusted_pvalue = 1 - (pvalue / 2) 
else:
    adjusted_pvalue = 1 - (pvalue / 2)  

print("T-statistic:", statistic)
print("Adjusted p-value (right-tailed):", adjusted_pvalue)
print()

if adjusted_pvalue < 0.05:
    print('pvalue < 0.05, Reject Ho.')          
else:
    print('pvalue > 0.05, Fail to reject Ho.')  


T-statistic: -162.15297021609814
Adjusted p-value (right-tailed): 0.9999999999999988

pvalue > 0.05, Fail to reject Ho.


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

A: Fail to reject Null Hypothesis which means that the mean AQI of Michigan is less than or equal to 10

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

A: Yes

#### **Question 6. Did New York or Ohio have a lower AQI?**

A: New York does

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



A: No

# Conclusion

**What are key takeaways from this project?**

A: THat you can efficiently use T tests to understand and check if your hypothesis is true or not. Based on the goal, you will need to change between right tailed, left tailed, and two tailed tests

**What would you consider presenting to your manager as part of your findings?**

A: I would present my three answers for the questions as well as the level of confidence for it as well as the p values. LA County not equal to rest of state, NY < Ohio, and Michigan < 10. I would also talk about the implications of each of these


**What would you convey to external readers?**

A: In this I would just share out final results and outcomes to better inform the public why these policy changes are taking place
