In [2]:
# Run this cell
%load_ext jupyter_ai_magics
from datascience import *
import numpy as np
import math
%matplotlib inline
import matplotlib.pyplot as plt

# CMPSC 5A Final â€” AQI

In this project, you will learn to apply all the concepts you have learnt in the class for far. This includes table manipulation (all table functions), iteration (for loops), conditional (if statements), data cleaning, and hypothesis testing etc.

## Names
Please list all students who worked on this project.
1. Student Name 1
2. Student Name 2

## Logistics

**Deadline:** The final project notebook is due Tuesday, June 4th, 2024 at 11:59pm PT. The final project presentation slides are due by 11:00am PT on Wednesday, June 5th. Unlike labs, **no late submissions are allowed**.

**Submission:** For full credit, you must complete all the questions and submit to Gradescope. You may still change your answers before the project deadline - only your final submission will be graded for correctness. Only one partner needs to submit the notebook to Gradescope, and they will need to add the other two group members as members on Gradescope. See [How to Add Group Member in Gradescope](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members). **After they submit, all group members should open their Gradescope accounts and see that a submission has been processed. Be sure that the final notebook has all the ChatGPT prompts used by all the team members**.

**Presentation:** Your group will need to create a presentation slide deck and give a 7-8 minute oral presentation during your assigned time slot. Presentations must use slides. Your group is not allowed to scroll through your notebook during the presentation. All slides must be uploaded to https://drive.google.com/drive/u/1/folders/1RCSdk8tVY8CR52ZVt68pbkEw9QZs77zs

**IMPORTANT NOTES:** 
- You are not limited to just one solution code cell, one prompt cell and one workflow cell for each question. Use as many of each as you like to ensure your notebook is presented well, easy-to-read, and has all the required plots and intermediate tables visible to show how you deduced that answer.
- None of the questions are created in a way that will allow you to just give a one line answer. Remember, if your answer is just a one line answer, you are probably missing something.
- Every group's answers may be different based on the approach you take for data analysis. Others may have visually presented it with a graph that may be different from yours, or filtered the table in a different way. That does not mean yours is wrong. We are looking for diversity in how information is displayed and there are more than one correct answers for each question.

**Partners:** You will work with two other partners (total three in a group); your partners can be from any lab section. 

**Rules:** Don't share your code with anybody but your partners. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem. Since the problems are open-ended, they can have various different answers. What is important is the approach you take to solve your task.

**Support:** You are not alone! Come to office hours, post on Ed, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Ed post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or ULA.

**Advice:** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, **TRY NOT TO** reuse the variable names.

You **never** have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!

# AQI Dataset

The Air Quality Index (AQI) is a standardized indicator used to communicate how polluted the air currently is or how polluted it is forecasted to become. Public health risks increase as the AQI rises, and the index is used by government agencies to determine when to issue health advisories or restrictions on industrial activities.

The dataset you will use is AQI by County. This dataset is compiled by the Environmental Protection Agency (EPA). This kind of data could be used for a variety of purposes, including public health analysis, environmental policy-making, and academic research into the effects of air quality on population health. It is also a valuable tool for informing the public about day-to-day variations in air quality and for issuing warnings on days when the air quality is particularly poor. The data contains entries for the years 2017 - 2023.

- `State`: The state in the United States where the measurement was taken.
- `County`: The specific county within the state where the air quality data was recorded.
- `Year`: The year for which the data is relevant.
- `Days with AQI`: The number of days in the year when the AQI was recorded.
- `Good Days`: The number of days where the AQI indicated that the air quality was good.
- `Moderate Days`: The number of days where the AQI indicated a moderate health concern.
- `Unhealthy for Sensitive Groups Days`: Days when the AQI suggested health effects for sensitive groups.
- `Unhealthy Days`: The number of days where the AQI indicated health effects for everyone.
- `Very Unhealthy Days`: Days where the AQI suggested health alert conditions.
- `Hazardous Days`: The number of days indicating hazardous air quality.
- `Max AQI`: The highest AQI recorded in the county for the year.
- `90th Percentile AQI`: The AQI level below which 90% of the AQI values fall.
- `Median AQI`: The median AQI value for the year.
- `Days CO`: The number of days when carbon monoxide was the predominant pollutant.
- `Days NO2`: The number of days when nitrogen dioxide was the predominant pollutant.
- `Days Ozone`: The count of days when ozone was the main pollutant.
- `Days PM2.5`: The number of days when fine particulate matter (PM2.5) was the main pollutant.
- `Days PM10`: The number of days when coarse particulate matter (PM10) was the main pollutant.

Please refer to [Chapter 12.1: A/B Testing](https://inferentialthinking.com/chapters/12/1/AB_Testing.html) when completing this final project as many of the questions use concepts covered there!

In [3]:
# Run this cell
aqi = Table.read_table("data/annual_aqi_2017-2023.csv")
aqi

State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
Alabama,Baldwin,2023,170,143,27,0,0,0,0,90,54,40,0,0,84,86,0
Alabama,Clay,2023,155,109,46,0,0,0,0,83,61,40,0,0,0,155,0
Alabama,DeKalb,2023,212,155,55,2,0,0,0,133,63,43,0,0,141,71,0
Alabama,Elmore,2023,118,102,16,0,0,0,0,90,54,40,0,0,118,0,0
Alabama,Etowah,2023,181,126,55,0,0,0,0,100,64,43,0,0,74,107,0
Alabama,Jefferson,2023,182,72,98,8,3,1,0,230,91,54,1,0,63,118,0
Alabama,Madison,2023,181,129,50,2,0,0,0,115,68,43,0,0,86,95,0
Alabama,Mobile,2023,178,133,45,0,0,0,0,90,59,43,0,0,68,110,0
Alabama,Montgomery,2023,150,97,53,0,0,0,0,93,71,47,0,0,66,84,0
Alabama,Morgan,2023,181,138,41,2,0,0,0,140,64,41,0,0,95,86,0


**Question 1:** Can you plot a multi-series line graph to compare the trend of `Good Days`, `Moderate Days`, and `Days with AQI` over the years for **a particular county of your choice**? This could help identify if there is a seasonal pattern in AQI ratings.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 2:** **Select a single year** and create a stacked bar graph for any 3 counties within a single state where the number of `Unhealthy for Sensitive Groups Days` > 4. Show the composition of `Good Days`, `Moderate Days`, `Unhealthy for Sensitive Groups Days`, `Unhealthy Days`, `Very Unhealthy Days`, and `Hazardous Days`. This will visually indicate the overall air quality profile for each county. **Do this for any 3 states**.

NOTE: Your final answer should have 9 individual stacked bars: 3 states x 3 counties. You can either have them condensed into less than 9 graphs or show 9 separate graphs.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 3:** What is the probability, based on the data for year 2023, that the County of Santa Barbara in California will have a good day? What about moderate days? Visualize the probability and also create a supplement graph that show the `Median AQI` over time in Santa Barbara.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 4:** Let's investigate AQI trends before and after the COVID-19 pandemic. First, define a table that contains one row per year with the sum of the good days for that year. Do not include 2020, which we will consider the pandemic year for the purposes of this problem. It should contain two columns:
- `Label`: Denotes if a year is part of a `Pre-Pandemic` (2017-2019) year or an `Post-Pandemic` year (2021-2023)
- `Average Good Days`: The average of the total good days across all state counties for a particular year.

Then, construct an overlaid histogram of two observed distributions: the average good days in pre-pandemic years and the average good days in post-pandemic years.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 5:** You are helping out a researcher conduct an in-progress A/B test about the differences in Days Ozone in California and Texas. Create a table which has the `Days Ozone` data for each California and Texas county.

The researcher is interested in analyzing the following observed test statistic:
$$\text{average days ozone across California counties} - \text{average days ozone across Texas counties}$$

**State one possible set of null and alternate hypotheses, and write another function** to simulate the test statistic under the null hypothesis once and return the value of the test statistic for that simulated sample. Then, use a loop to run your function 5000 times and store each sample test statistic in an array. Create a visualization to display the distribution of sample test statistics.ic.


In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 6:** 
Using the observed test statistic and the array of sample test statistics, compute the p-value for the hypothesis test. Remember, we introduce p-values and how to compute them in [Ch 11.3 Decisions and Uncertainty](https://inferentialthinking.com/chapters/11/3/Decisions_and_Uncertainty.html). State a conclusion from this test using a p-value cutoff of 5%.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 7:**  Create a table that has one row for each state in 2023 with its `Median AQI`. Divide the states into West Coast states and East Coast states based on the [U.S. Map](https://upload.wikimedia.org/wikipedia/commons/a/a4/Map_of_USA_with_state_and_territory_names_2.png). You don't have to use all the states. Let's begin the process of conducting an A/B test investigating the differences in average `Median AQI` between West Coast and East Coast States. Please define a null and alternate hypothesis, create appropriate visuals to display and compare the two distributions, and compute your test statistic. What does your test statistic mean in context?

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 8:** Continue the A/B test from Question 7. Simulate the test statistic under the null hypotheses and visualize the results. Compute the p-value for the test and state the conclusion of the hypothesis test.


In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 9 (Novel Analysis Part 1):**

Design your own A/B test! Choose one variable whose trends you would like to investigate, and create a new table with that variable's data for the years and states you would like to analyze. Remember to divide the overall table into two groups that you want to compare for the test, and conduct an A/B test investigating the differences between the two groups.

Let's begin the process.

a) Define null and alternate hypotheses.

b) Create a new table with state grouping and the average Median AQI for the states you choseCreate a

c) Appropriate visuals to display and compare the two distr.ypothesis test

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_

**Question 10 (Novel Analysis Part 2)**

Continue the A/B Test from Question 9!

a) State your test statistic value and what it means in context.

b) Simulate the test statistic under the null hypotheses and visualize the results.

c) Compute the p-value for the test and state the conclusion of the hypothesis test.

In [None]:
#SOLUTION

#### Enter prompt below:

In [None]:
%%ai openai-chat:gpt-3.5-turbo

#### Explain your answer below:

_type your answer here_