# Mini Project 5-1 Explore Descriptive Statistics with Python

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread. 

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports** 


Import the relevant Python libraries `pandas` and `numpy`.

In [4]:
# Import relevant Python libraries.

import pandas as pd
import numpy as np


The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a susbet of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this Project. Please continue with this activity by completing the following instructions.

In [5]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
# Display the first five rows of the dataset
df = pd.read_csv("c4_epa_air_quality.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to read in data from a .csv file and load it into a DataFrame. 

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `read_csv` function from the pandas `library`. The `index_col` parameter can be set to `0` to read in the first column as an index (and to avoid `"Unnamed: 0"` appearing as a column in the resulting DataFrame).

</details>

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [6]:
# Display first 10 rows of the data.

### YOUR CODE HERE

df.head(10)

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to get a specific number of rows from the top of a DataFrame. 

</details>

**Question:** What does the `aqi` column represent?

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `head()` function from the `pandas` library.

</details>

A: Perhaps Air Quality Index

**Question:** In what units are the aqi values expressed?

A: Parts per Million

Now, get a table that contains some descriptive statistics about the data.

In [8]:
# Get descriptive stats.

### YOUR CODE HERE

df.describe()

Unnamed: 0.1,Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0,260.0
mean,129.5,0.403169,6.757692
std,75.199734,0.317902,7.061707
min,0.0,0.0,0.0
25%,64.75,0.2,2.0
50%,129.5,0.276315,5.0
75%,194.25,0.516009,9.0
max,259.0,1.921053,50.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate a table of basic descriptive statistics about the numeric columns in a DataFrame.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `describe()` function from the `pandas` library.

</details>

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

A: That it is 260 which is the same as the other columns and hence there are no missing values

**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie. 

A: It is 2 which is closer to the 50th percentile than the 75th percentile is to the 50th percentile. Hence a slight rightward skew

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie. 

A: It is 9 which is 4 away from the median which is more than than how much the 25th percentile is from the median which is 3. hence a rightward skew

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [10]:
# Get descriptive stats about the states in the data.

### YOUR CODE HERE

df.groupby('state_name').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,arithmetic_mean,aqi,aqi,aqi,aqi,aqi,aqi,aqi,aqi
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
state_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Alabama,1.0,71.0,,71.0,71.0,71.0,71.0,71.0,1.0,0.2,...,0.2,0.2,1.0,2.0,,2.0,2.0,2.0,2.0,2.0
Alaska,2.0,240.5,3.535534,238.0,239.25,240.5,241.75,243.0,2.0,0.555264,...,0.556579,0.557895,2.0,8.5,0.707107,8.0,8.25,8.5,8.75,9.0
Arizona,14.0,131.714286,88.793995,0.0,67.5,140.5,204.5,254.0,14.0,0.671804,...,0.93421,1.921053,14.0,15.214286,14.104983,2.0,5.5,10.0,19.75,50.0
Arkansas,1.0,186.0,,186.0,186.0,186.0,186.0,186.0,1.0,0.3,...,0.3,0.3,1.0,3.0,,3.0,3.0,3.0,3.0,3.0
California,66.0,137.272727,69.66702,16.0,76.25,144.0,198.75,250.0,66.0,0.684871,...,0.971491,1.742105,66.0,12.121212,7.301244,1.0,7.0,11.0,16.0,40.0
Colorado,9.0,119.333333,80.614825,9.0,56.0,97.0,179.0,251.0,9.0,0.330994,...,0.342105,0.478947,9.0,5.0,1.322876,3.0,5.0,5.0,6.0,7.0
Connecticut,4.0,119.25,94.757146,15.0,82.5,108.5,145.25,245.0,4.0,0.2,...,0.231944,0.261111,4.0,3.5,1.914854,1.0,2.5,4.0,5.0,5.0
Delaware,1.0,135.0,,135.0,135.0,135.0,135.0,135.0,1.0,0.215789,...,0.215789,0.215789,1.0,3.0,,3.0,3.0,3.0,3.0,3.0
District Of Columbia,2.0,219.0,50.911688,183.0,201.0,219.0,237.0,255.0,2.0,0.222222,...,0.233333,0.244444,2.0,2.5,0.707107,2.0,2.25,2.5,2.75,3.0
Florida,12.0,149.0,66.311935,39.0,109.0,153.0,204.25,232.0,12.0,0.299123,...,0.353947,0.6,12.0,5.5,2.430862,1.0,4.5,6.0,7.0,9.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate basic descriptive statistics about a DataFrame or a column you are interested in.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

 Use the `describe()` function from the `pandas` library. Note that this function can be used:
- "on a DataFrame (to find descriptive statistics about the numeric columns)" 
- "directly on a column containing categorical data (to find pertinent descriptive statistics)"

</details>

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data? 

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

A: Some states have missing std values. Also the mean and median of different states are very away from other states indicating incosistent aqi ratings

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [14]:
# Compute the mean value from the aqi column.

### YOUR CODE HERE

mean_aqi = df['aqi'].mean()

print("Mean AQI:", mean_aqi)

Mean AQI: 6.757692307692308


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the mean value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `mean()` function from the `numpy` library.

</details>

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

A: It is quite low compared to the range in the values

Next, compute the median value from the aqi column.

In [15]:
# Compute the median value from the aqi column.

### YOUR CODE HERE

median_aqi = df['aqi'].median()

print("Median AQI:", median_aqi)

Median AQI: 5.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the median value from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `median()` function from the `numpy` library.

</details>

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

A: Slightly lower than the mean implying there are some outlier values to the right

Next, identify the minimum value from the `aqi` column.

In [16]:
# Identify the minimum value from the aqi column.

### YOUR CODE HERE

min_aqi = df['aqi'].min()

print("Min AQI:", min_aqi)

Min AQI: 0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the minimum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `min()` function from the `numpy` library.

</details>

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

The minimum value for the aqi column is 0. This means that the smallest aqi value in the data is 0 parts per million.


Now, identify the maximum value from the `aqi` column.

In [17]:
# Identify the maximum value from the aqi column.

### YOUR CODE HERE
max_aqi = df['aqi'].max()

print("Max AQI:", max_aqi)

Max AQI: 50


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the maximum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `max()` function from the `numpy` library.

</details>

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

A: This is 50 which means that some outliers exist and there are very far from the mean

Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [18]:
# Compute the standard deviation for the aqi column.

### YOUR CODE HERE

std_aqi = df['aqi'].std()

print("Std AQI:", std_aqi)

Std AQI: 7.061706678820724


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the standard deviation from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

Use the `std()` function from the `numpy` library. Make sure to specify the `ddof` parameter as 1. To read more about this function,  refer to its documentation in the references section of this lab.

</details>

**Question:** What do you notice about the standard deviation for the `aqi` column? 

This is an important measure of how spread out the aqi values are.

A: Fairly high standard deviation which means they are somewhat spread out

## **Considerations**


**What are some key takeaways that you learned during this Project?**

A: That it is fairly easy to use different functions to analyze a dataset

**How would you present your findings from this Project to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9 parts per million."

A: I would list the AQis of different staes and then talk about the aggregate descriptor AQI of the entire data set. Mention min, 25th percentile, median, and 75th percentile, and max, and mention the units being parts per million.

**What summary would you provide to readers? Use the same information provided previously from AirNow.gov as you respond.**

A: The Air Quality Index (AQI) data shoes a range of air quality levels across different states. The lowest recorded AQI value is 0 while the highest reaches 50. The Median is 5. According to AirNOw.gov, AQI values above 100 indicate unhealthy air quality, particular for sensitive groups at first. If any states have an AQI above 100, there might be health risks associated

Additionally, an AQI of 100 for carbon monoxide corresponds to 9 parts per million.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 