# Activity: Explore descriptive statistics

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread.

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports**


Import the relevant Python libraries `pandas` and `numpy`.

In [None]:
# Import relevant Python libraries.

### YOUR CODE HERE ###
import pandas as pd
import numpy as np



The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a susbet of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

<details>
  <summary><h4><strong>Hint</strong></h4></summary>

  Use the `read_csv` function from the pandas `library`. The `index_col` parameter can be set to `0` to read in the first column as an index (and to avoid `"Unnamed: 0"` appearing as a column in the resulting DataFrame).

</details>

## **Step 2: Data exploration**

To understand how the dataset is structured, display the first 10 rows of the data.

In [None]:
# Display first 10 rows of the data.

### YOUR CODE HERE
print(data.head(10))

<details>
  <summary><h4><strong>Hint</strong></h4></summary>

  Use the `head()` function from the `pandas` library.

</details>

**Question:** What does the `aqi` column represent?

[Aqi represents the Air Quality index. its range is 0 to 500 generally. range from 0 to 50 indicates good air quality. where lower values represent better air quality and higher values indicate worsening conditions. Values ≤100 are generally considered acceptable.
Values >100 may pose health risks, particularly for sensitive groups. ]

Now, get a table that contains some descriptive statistics about the data.

In [None]:
# Get descriptive stats.

### YOUR CODE HERE
print(data.describe())


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate a table of basic descriptive statistics about the numeric columns in a DataFrame.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `describe()` function from the `pandas` library.

</details>

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

[The descriptive statistics show that the count for the aqi column is 260, which means that all 260 rows in the dataset have a recorded AQI value. This indicates that there are no missing values in the aqi column.]

**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie.

From descriptive statistics table, the 25th percentile (also known as the first quartile, Q1) for the aqi column is 2. This means that 25% of the AQI values are at or below 2. It's a very low value, indicating that a significant portion of the data reflects excellent air quality (AQI values ≤ 2).

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie.

From descriptive statistics table, the 75th percentile (also known as the third quartile, Q3) for the aqi column is 9. This means that 75% of the AQI values are at or below 9. In other words, the majority of the air quality readings (up to 75%) are still within the "good" air quality range (AQI ≤ 50).

## **Step 3: Statistical tests**

Next, get some descriptive statistics about the states in the data.

In [None]:
# Get descriptive stats about the states in the data.

### YOUR CODE HERE
state_aqi_statistics = data.groupby('state_name').describe()
print(state_aqi_statistics)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate basic descriptive statistics about a DataFrame or a column you are interested in.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

 Use the `describe()` function from the `pandas` library. Note that this function can be used:
- "on a DataFrame (to find descriptive statistics about the numeric columns)"
- "directly on a column containing categorical data (to find pertinent descriptive statistics)"

</details>

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data?

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

Data is same.

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [None]:
# Compute the mean value from the aqi column.

### YOUR CODE HERE
aqi_mean = data['aqi'].mean()
print("Mean AQI:", aqi_mean)


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the mean value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `mean()` function from the `numpy` library.

</details>

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

An AQI mean of 6.76 falls well within the "good" range (AQI ≤ 50). This suggests that, on average, the air quality in the monitored locations is generally safe and healthy for the general population.

Next, compute the median value from the aqi column.

In [None]:
# Compute the median value from the aqi column.

### YOUR CODE HERE
aqi_median = data['aqi'].median()
print("Median AQI:", aqi_median)



<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the median value from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `median()` function from the `numpy` library.

</details>

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

A median of 5 indicates that half of the AQI values are below this number. This further reinforces the finding that air quality is generally good, as 5 is well below the threshold of 50, which marks the boundary between "good" and "moderate" air quality.

Next, identify the minimum value from the `aqi` column.

In [None]:
# Identify the minimum value from the aqi column.

### YOUR CODE HERE

aqi_min = data['aqi'].min()
print("Minimum AQI:", aqi_min)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the minimum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `min()` function from the `numpy` library.

</details>

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

An AQI of 0 indicates excellent air quality. It means that at least one observation (or potentially several) recorded no detectable levels of pollutants on that day. This is the best possible air quality status, reflecting clean air conditions.


Now, identify the maximum value from the `aqi` column.

In [None]:
# Identify the maximum value from the aqi column.

### YOUR CODE HERE
aqi_max = data['aqi'].max()
print("Maximum AQI:", aqi_max)


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the maximum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `max()` function from the `numpy` library.

</details>

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

An AQI of 50 represents the highest recorded level of air quality in this dataset. While it is still within the "good" range (AQI ≤ 50), it indicates that the worst air quality measured during the observation period reached this level.

Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [None]:
# Compute the standard deviation for the aqi column.

### YOUR CODE HERE
aqi_std = data['aqi'].std(ddof=1)
print("Standard Deviation of AQI:", aqi_std)



<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video section about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the standard deviation from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

Use the `std()` function from the `numpy` library. Make sure to specify the `ddof` parameter as 1. To read more about this function,  refer to its documentation in the references section of this lab.

</details>

**Question:** What do you notice about the standard deviation for the `aqi` column?

This is an important measure of how spread out the aqi values are.

A standard deviation of 7.06 indicates a moderate level of dispersion in the AQI values around the mean (6.76). This means that while most AQI readings are clustered around the average, there are variations that reflect differing air quality conditions across different locations or times.

## **Considerations**


**What are some key takeaways that you learned during this lab?**

Good Air Quality: The data shows generally good air quality with average AQI well below 100.
Moderate Variability: The standard deviation indicates some variability in AQI levels.
Health Implications: AQI values below 100 are satisfactory, but ongoing monitoring is crucial to avoid potential health risks.
Descriptive Statistics: Key metrics like mean, median, and standard deviation provide insights into air quality trends.
Data Analysis Skills: Reinforced the importance of using Python and data analysis techniques to interpret real-world data.

**How would you present your findings from this lab to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

General Air Quality: The average AQI is around 6.76, with the highest value at 50, both well below the "unhealthy" threshold of 100. This means air quality is generally satisfactory and safe.

Health Implications: AQI values above 100 are considered unhealthy, initially for sensitive groups, and increasingly for everyone as values rise. Our data indicates no such risk.

**What summary would you provide to stakeholders? Use the same information provided previously from AirNow.gov as you respond.**

Our recent analysis of air quality data reveals that the average AQI is 6.76, with a maximum recorded value of 50, both well below the threshold of 100. According to AirNow.gov, AQI values under 100 are considered satisfactory, indicating that the air quality in the monitored regions is generally safe.

However, it’s important to note that AQI values above 100 are considered unhealthy, first affecting sensitive groups and then everyone as values increase. For reference, an AQI of 100 corresponds to 9.4 parts per million of carbon monoxide, highlighting pollutant levels that could pose health risks.

Key Takeaway: Current air quality is satisfactory, but continuous monitoring is recommended to ensure public health and safety.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data).

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.