# Using descriptive statistics in air quality data

## **Introduction**

In this activity, I will work with data from the United States Environmental Protection Agency (EPA). I will analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names.

## **Step 1: Imports** 


In [1]:
# Import relevant Python libraries.
import pandas as pd
import numpy as np

In [2]:
# Read dataset
epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

## **Step 2: Data exploration** 

To understand how the dataset is structured, I will display the first 10 rows of the data.

In [3]:
# Display first 10 rows of the data.
epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


We can observe that there are many ways to sort this dataset. For this particular exercise, I will focus on the states column. 

In [4]:
# Get descriptive stats.
epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


From these statistics, we can stand out: 

- There are 260 aqi measurements represented in this dataset.
- 25% of the aqi values in the data are below 2.
- 75% of the aqi values in the data are below 9.

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [12]:
# Get descriptive stats about the states in the data.
epa_data['state_name'].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

After using the descriptive function, we can observe that there are 260 different not-null rows. There are 52 unique values. In the first instance, this is a red flag. As everybody knows, there are only 50 states in the United States. Probably, some states are mispelled. On the other hand, the most frequently value is the state of California, with 66 appearances.

In order to fix that, I'm sorting the data and checking if any state is misspelled. 

In [15]:
# Sort state_name column
epa_data['state_name'].value_counts()

state_name
California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska         

As a matter of fact, there are no mispelled or repeated states. Puerto Rico and the District of Columbia are on this list. To clarify Puerto Rico is a U.S. territory, and the District of Columbia is a federal district, not a state. Anyway, there is no reason to remove this two territory from the data. Their information is still of great value for the purpose if this exercise. 

## **Step 4. Results and evaluation**

In [13]:
# Mean value from the aqi column.
epa_data['aqi'].mean()

6.757692307692308

In [16]:
# Median value from the aqi column.
np.median(epa_data["aqi"])

5.0

In [17]:
# the minimum value from the aqi column.
np.min(epa_data["aqi"])

0

In [18]:
# the maximum value from the aqi column.
np.max(epa_data["aqi"])

50

In [21]:
# standard deviation for the aqi column.
np.std(epa_data["aqi"], ddof=1)

7.061706678820724

The standard deviation for the aqi column is about 7.05. This value gives us an idea of how spread out the values in this column are. Plus, we know the average value is about 6.76, and the median of the data is below 5.0. All these statistics give us a bigger picture of the column. 

## Conclusion

75% of the aqui values are below 9. This means at least 75% of the data comes from places with good air quality. Knowing this, the EPA can focus on helping the places where their values are over 9. 

In this notebook, I focused on getting the first important statistics to know the state of the data and familiarize myself with the dataset in general. In the next notebook, I will apply probability distribution to this dataset.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 