# Activity: Explore descriptive statistics

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread. 

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports** 


Import the relevant Python libraries `pandas` and `numpy`.

In [1]:
# Import relevant Python libraries.
import pandas as pd
import numpy as np

The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a subset of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about loading data in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to read in data from a .csv file and load it into a DataFrame. 

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `read_csv` function from the pandas `library`. The `index_col` parameter can be set to `0` to read in the first column as an index (and to avoid `"Unnamed: 0"` appearing as a column in the resulting DataFrame).

</details>

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [3]:
# Display first 10 rows of the data.
epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about exploratory data analysis in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to get a specific number of rows from the top of a DataFrame. 

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `head()` function from the `pandas` library.

</details>

**Question:** What does the `aqi` column represent?

The aqi column represents the Air Quality Index, a standardized value used to report daily air quality. It indicates how clean or polluted the air is and what associated health effects might be of concern. A higher AQI value typically signifies worse air quality and greater potential health risks.

Now, get a table that contains some descriptive statistics about the data.

In [4]:
# Get descriptive stats.
epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate a table of basic descriptive statistics about the numeric columns in a DataFrame.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `describe()` function from the `pandas` library.

</details>

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

The count value tells us how many non-null (non-missing) entries exist in the aqi column. If the count is less than the total number of rows in the dataset, it indicates that some values are missing. This helps assess data completeness.

**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie. 

The 25th percentile (also known as Q1) tells us that 25% of the AQI values fall below this number. This gives a sense of what constitutes lower-end air quality in the dataset and can help identify areas with consistently better air.

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie. 

The 75th percentile (Q3) shows that 75% of the AQI values fall below this number. This value helps highlight the upper range of more polluted air conditions and is useful for detecting outliers or unusually high AQI levels.

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [5]:
# Get descriptive stats about the states in the data.
epa_data["state_name"].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate basic descriptive statistics about a DataFrame or a column you are interested in.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

 Use the `describe()` function from the `pandas` library. Note that this function can be used:
- "on a DataFrame (to find descriptive statistics about the numeric columns)" 
- "directly on a column containing categorical data (to find pertinent descriptive statistics)"

</details>

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data? 

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

[Write your response here. Double-click (or enter) to edit.]

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [6]:
import numpy as np

# Compute the mean value from the aqi column
mean_aqi = np.mean(epa_data["aqi"])
print("Mean AQI:", mean_aqi)

Mean AQI: 6.757692307692308


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the mean value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `mean()` function from the `numpy` library.

</details>

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

The mean AQI value gives us an estimate of the average air quality across all recorded sites. A higher mean could indicate more widespread pollution events or consistently poor air quality. If the mean is close to the median or the 50th percentile, this suggests a relatively symmetric distribution; if it is much higher than the median, it may indicate skewed data with some high-AQI outliers.

Next, compute the median value from the aqi column.

In [7]:
import numpy as np

# Compute the median value from the aqi column
median_aqi = np.median(epa_data["aqi"])
print("Median AQI:", median_aqi)

Median AQI: 5.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the median value from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `median()` function from the `numpy` library.

</details>

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

The median AQI represents the middle value of the dataset and is less affected by outliers than the mean. If the median is significantly lower than the mean, it indicates a right-skewed distribution, suggesting that some very high AQI readings are pulling the mean upward. This skewness can point to occasional severe pollution events affecting the overall average but not the majority of the data.

Next, identify the minimum value from the `aqi` column.

In [8]:
import numpy as np

# Identify the minimum value from the aqi column
min_aqi = np.min(epa_data["aqi"])
print("Minimum AQI:", min_aqi)

Minimum AQI: 0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the minimum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `min()` function from the `numpy` library.

</details>

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

The minimum AQI value represents the best air quality recorded in the dataset. A low AQI (especially between 0 and 50) indicates good air quality with little or no risk to health. This tells us that at least one measurement point in the dataset experienced excellent air quality conditions.

Now, identify the maximum value from the `aqi` column.

In [9]:
import numpy as np

# Identify the maximum value from the aqi column
max_aqi = np.max(epa_data["aqi"])
print("Maximum AQI:", max_aqi)

Maximum AQI: 50


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the maximum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `max()` function from the `numpy` library.

</details>

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

The maximum AQI value represents the worst air quality recorded in the dataset. A high AQI (especially above 150) indicates unhealthy or hazardous air quality, particularly for sensitive groups or even the general population. This helps identify pollution hotspots or problematic time periods that may need further investigation or regulatory action.

Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [10]:
# Compute the standard deviation for the aqi column
std_aqi = np.std(epa_data["aqi"], ddof=1)  # ddof=1 for sample standard deviation
print("Standard Deviation of AQI:", std_aqi)

Standard Deviation of AQI: 7.0617066788207215


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video section about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the standard deviation from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

Use the `std()` function from the `numpy` library. Make sure to specify the `ddof` parameter as 1. To read more about this function,  refer to its documentation in the references section of this lab.

</details>

**Question:** What do you notice about the standard deviation for the `aqi` column? 

This is an important measure of how spread out the aqi values are.

The standard deviation represents the amount of variation or spread in the AQI data. A larger standard deviation indicates that the air quality measurements are widely spread out from the mean, whereas a smaller standard deviation indicates that the values are closer to the mean.

If the standard deviation is relatively large compared to the mean, it suggests that there are areas with significantly worse air quality (with high AQI values) as well as areas with better air quality (with lower AQI values). Understanding this can help identify regions with more variable air quality.

## **Considerations**


**What are some key takeaways that you learned during this lab?**

During this lab, I learned how to effectively calculate and interpret key descriptive statistics for a dataset, including measures of central tendency (mean, median), spread (standard deviation), and boundaries (minimum and maximum values). These measures are critical for understanding the data and identifying trends or outliers. The analysis also helped me understand how the AQI (Air Quality Index) values vary across different locations, which provides valuable insight into regions with potentially hazardous air quality.

**How would you present your findings from this lab to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

Presentation of Findings:
To present the findings from this lab, I would emphasize the following key insights:

Average Air Quality (Mean AQI): The mean AQI provides an overall picture of the average air quality in the dataset. If the mean AQI is close to or above 100, this indicates that the air quality may not be satisfactory for many areas, and steps may need to be taken to address pollution.

Spread of AQI Values: The standard deviation allows us to assess how much air quality deviates from the average across different regions. If there is high variability, this could mean that some areas experience significantly worse air quality compared to others.

Extreme AQI Values: Identifying the maximum and minimum AQI values helps to pinpoint the worst and best air quality recorded. This is particularly important when considering vulnerable populations, as extremely high AQI values can indicate unhealthy air conditions.

Interpretation of AQI Levels: Based on AirNow.gov's guidelines, AQI values below 100 are generally considered satisfactory, while values above 100 become progressively more hazardous, especially for sensitive groups. As AQI increases above 100, the health risk extends to the general population.

**What summary would you provide to stakeholders? Use the same information provided previously from AirNow.gov as you respond.**

Based on the results of this analysis:

Most areas have an acceptable AQI (less than 100), but there are regions with significant pollution that exceed the threshold for unhealthy air quality.

The mean AQI across the dataset indicates the average air quality across different sites, and the standard deviation shows how much variability exists in air quality.

There are areas with extremely high AQI values, which could pose health risks, especially to sensitive groups like children, elderly individuals, or those with respiratory conditions.

I would recommend focusing efforts on reducing carbon monoxide emissions in areas with AQI values well above 100 to ensure the health and well-being of residents and visitors.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 