In [4]:
# Run this cell once when starting on notebook.
# On Google Colab, wait for the Google Drive permission prompts before proceeding
import matplotlib.pyplot as plt
%matplotlib inline
import os
import sys
try:
    %load_ext jupyter_ai_magics
except:
    print("%%ai cells will not work in this notebook")
    print("Please use Gemini for AI queries instead")
from datascience import *
import numpy as np
import math

DATA_FILENAME="data/annual_aqi_2017-2023.csv"
try:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)
  !mkdir -p /content/data
  !gdown --fuzzy https://drive.google.com/file/d/1wsoAjxtyQN5ty4oB5LO845YqJTHEXBps/view?usp=drive_link -O /content/data/annual_aqi_2017-2023.csv
  DATA_FILENAME = DATA_FILENAME.replace("data/", "/content/data/")
  !ls -l /content/data
except:
  print("Google Drive not mounted; this is normal on Jupyter Hub")


%%ai cells will not work in this notebook
Please use Gemini for AI queries instead
Mounted at /content/drive
Downloading...
From: https://drive.google.com/uc?id=1wsoAjxtyQN5ty4oB5LO845YqJTHEXBps
To: /content/data/annual_aqi_2017-2023.csv
100% 452k/452k [00:00<00:00, 109MB/s]
total 444
-rw-r--r-- 1 root root 451828 May 20 19:31 annual_aqi_2017-2023.csv


# CS5A S25 Final Project: AirQuality

* Please refer to the general instructions in [this document](https://docs.google.com/document/d/1qgS-GPKsbcbqNq8bbDk8kiB0bH-HRZcfN4La75bOtWU/edit?usp=sharing) before starting.
* You may work on either JupyterHub or Google Colab


If working on Colab:

* The Google Colab version is [this folder](https://drive.google.com/drive/folders/1wsVszAufxxmSR7_uoOoR6QBQmjlurPNx?usp=drive_link)
* Make a copy of the notebook in your group folder for the final team project before starting to make edits.
* If/when working in the same file, be sure to coordinate with your group so that only one member of the group is editing at a time; Colab doesn't handle simulataneous editing very well.






## Names

Please list all students that were a member of this team

1. Student Name 1
2. Student Name 2
3. Student Name 3
4. Student Name 4

## Member Responsibilities



*Write your team member responsibility distribution here* (See [instructions](https://docs.google.com/document/d/1qgS-GPKsbcbqNq8bbDk8kiB0bH-HRZcfN4La75bOtWU/edit?tab=t.0#bookmark=id.igbdergm85kj))


\

# CMPSC 5A Final — AQI

In this project, you will learn to apply all the concepts you have learnt in the class for far. This includes table manipulation (all table functions), iteration (for loops), conditional (if statements), data cleaning, and hypothesis testing etc.

The Air Quality Index (AQI) is a standardized indicator used to communicate how polluted the air currently is or how polluted it is forecasted to become. Public health risks increase as the AQI rises, and the index is used by government agencies to determine when to issue health advisories or restrictions on industrial activities.

The dataset you will use is AQI by County. This dataset is compiled by the Environmental Protection Agency (EPA). This kind of data could be used for a variety of purposes, including public health analysis, environmental policy-making, and academic research into the effects of air quality on population health. It is also a valuable tool for informing the public about day-to-day variations in air quality and for issuing warnings on days when the air quality is particularly poor. The data contains entries for the years 2017 - 2023.

- `State`: The state in the United States where the measurement was taken.
- `County`: The specific county within the state where the air quality data was recorded.
- `Year`: The year for which the data is relevant.
- `Days with AQI`: The number of days in the year when the AQI was recorded.
- `Good Days`: The number of days where the AQI indicated that the air quality was good.
- `Moderate Days`: The number of days where the AQI indicated a moderate health concern.
- `Unhealthy for Sensitive Groups Days`: Days when the AQI suggested health effects for sensitive groups.
- `Unhealthy Days`: The number of days where the AQI indicated health effects for everyone.
- `Very Unhealthy Days`: Days where the AQI suggested health alert conditions.
- `Hazardous Days`: The number of days indicating hazardous air quality.
- `Max AQI`: The highest AQI recorded in the county for the year.
- `90th Percentile AQI`: The AQI level below which 90% of the AQI values fall.
- `Median AQI`: The median AQI value for the year.
- `Days CO`: The number of days when carbon monoxide was the predominant pollutant.
- `Days NO2`: The number of days when nitrogen dioxide was the predominant pollutant.
- `Days Ozone`: The count of days when ozone was the main pollutant.
- `Days PM2.5`: The number of days when fine particulate matter (PM2.5) was the main pollutant.
- `Days PM10`: The number of days when coarse particulate matter (PM10) was the main pollutant.

Please refer to [Chapter 12.1: A/B Testing](https://inferentialthinking.com/chapters/12/1/AB_Testing.html) when completing this final project as many of the questions use concepts covered there!

In [6]:
# Run this cell
aqi = Table.read_table(DATA_FILENAME)
aqi

State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
Alabama,Baldwin,2023,170,143,27,0,0,0,0,90,54,40,0,0,84,86,0
Alabama,Clay,2023,155,109,46,0,0,0,0,83,61,40,0,0,0,155,0
Alabama,DeKalb,2023,212,155,55,2,0,0,0,133,63,43,0,0,141,71,0
Alabama,Elmore,2023,118,102,16,0,0,0,0,90,54,40,0,0,118,0,0
Alabama,Etowah,2023,181,126,55,0,0,0,0,100,64,43,0,0,74,107,0
Alabama,Jefferson,2023,182,72,98,8,3,1,0,230,91,54,1,0,63,118,0
Alabama,Madison,2023,181,129,50,2,0,0,0,115,68,43,0,0,86,95,0
Alabama,Mobile,2023,178,133,45,0,0,0,0,90,59,43,0,0,68,110,0
Alabama,Montgomery,2023,150,97,53,0,0,0,0,93,71,47,0,0,66,84,0
Alabama,Morgan,2023,181,138,41,2,0,0,0,140,64,41,0,0,95,86,0


**Question 1:** Can you plot a multi-series line graph to compare the trend of `Good Days`, `Moderate Days`, and `Days with AQI` over the years for **a particular county of your choice**? This could help identify if there is a seasonal pattern in AQI ratings.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 2:** **Select a single year** and create a graph for any 3 counties within a single state where the number of `Unhealthy for Sensitive Groups Days` > 4. Show the composition of `Good Days`, `Moderate Days`, `Unhealthy for Sensitive Groups Days`, `Unhealthy Days`, `Very Unhealthy Days`, and `Hazardous Days`. This will visually indicate the overall air quality profile for each county. **Repeat this thrice: choose three different states**.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 3:** What is the probability, based on the data for year 2023, that the County of Santa Barbara in California will have a good day? What about moderate days? Visualize the probability and also create a supplement graph that show the `Median AQI` over time in Santa Barbara.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 4:** Let's investigate AQI trends by state. First, define a table that contains the sum of Median AQIs for each state, and categorize each state by whether it is North or South. The categorizations are given to you, so delete any rows whose states are not included. The table should contain two columns:
- `Label`: Denotes if a state is North or South
- `Total Median AQI`: The sum of the total median AQI for each state.

Then, construct an overlaid histogram of two observed distributions: the total median AQI in Northern States and the total median AQI in Southern States.

In [None]:
# Run This cell
north_states = [
    'Washington', 'Oregon', 'Idaho', 'Montana', 'Wyoming', 'North Dakota', 'South Dakota',
    'Minnesota', 'Wisconsin', 'Michigan', 'New York', 'Maine', 'Vermont', 'New Hampshire',
    'Massachusetts', 'Rhode Island', 'Connecticut', 'Pennsylvania', 'New Jersey', 'Ohio',
    'Indiana', 'Illinois', 'Iowa', 'Nebraska', 'Colorado', 'Utah', 'Nevada', 'California', 'Alaska'
]

south_states = [
    'Arizona', 'New Mexico', 'Texas', 'Oklahoma', 'Kansas', 'Missouri', 'Arkansas', 'Louisiana',
    'Mississippi', 'Alabama', 'Georgia', 'Florida', 'South Carolina', 'North Carolina', 'Tennessee',
    'Kentucky', 'Virginia', 'West Virginia', 'Maryland', 'Delaware', 'Hawaii'
]

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 5:** You are helping out a researcher conduct an in-progress A/B test about the differences in Days Ozone in California and Texas. Create a table which has the `Days Ozone` data for each California and Texas county.

The researcher is interested in analyzing the following observed test statistic:
$$\text{average days ozone across California counties} - \text{average days ozone across Texas counties}$$

**State one possible set of null and alternate hypotheses, and write another function** to simulate the test statistic under the null hypothesis once and return the value of the test statistic for that simulated sample. Then, use a loop to run your function 5000 times and store each sample test statistic in an array. Create a visualization to display the distribution of sample test statistics.


In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 6:**
Using the observed test statistic and the array of sample test statistics, compute the p-value for the hypothesis test. Remember, we introduce p-values and how to compute them in [Ch 11.3 Decisions and Uncertainty](https://inferentialthinking.com/chapters/11/3/Decisions_and_Uncertainty.html). State a conclusion from this test using a p-value cutoff of 5%.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 7:**  Create a table that has one row for each state in 2023 with its `Median AQI`. Divide the states into West Coast states and East Coast states based on the [U.S. Map](https://upload.wikimedia.org/wikipedia/commons/a/a4/Map_of_USA_with_state_and_territory_names_2.png). You don't have to use all the states. Let's begin the process of conducting an A/B test investigating the differences in average `Median AQI` between West Coast and East Coast States. Please define a null and alternate hypothesis, create appropriate visuals to display and compare the two distributions, and compute your test statistic. What does your test statistic mean in context?

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 8:** Continue the A/B test from Question 7. Simulate the test statistic under the null hypotheses and visualize the results. Compute the p-value for the test and state the conclusion of the hypothesis test.


In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 9 (Novel Analysis Part 1):**

Design your own A/B test! Choose one variable whose trends you would like to investigate, and create a new table with that variable's data for the years and states you would like to analyze. Remember to divide the overall table into two groups that you want to compare for the test, and conduct an A/B test investigating the differences between the two groups.

Let's begin the process.

a) Define null and alternate hypotheses.

b) Plot the observed distribution of the variable you will be observing

c) State your test statistic value and what it means in the context of the problem.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-4

#### Explain your answer below:

_type your answer here_

**Question 10 (Novel Analysis Part 2)**

Continue the A/B Test from Question 9!


a) Simulate the test statistic under the null hypothesis and visualize the results.

b) Select a significance level for your test and state the conclusion of the hypothesis test.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_