<a href="https://colab.research.google.com/github/rk928/Ranjit-Kumar/blob/main/ENVS_617_Assignment_4_Plotting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4: Plotting

In this assignment, you'll be making plots looking at air quality in New Delhi, India between 2015 and 2020. In the process, we'll review some data wrangling and data QA.

In this assignment we'll be exploring two questions:

1. How does air quality vary over time in New Delhi? Are there seasonal patterns?
2. How does air quality vary with temperature?

# Data Documentation

## Air Quality Data
* We'll look at air quality data from AirNow. (AirNow is a partnership across many US federal agencies and is the US's official source for air quality data globally.)


* The AQI data we'll look at today is from the US embassy in New Delhi from the "Historical" tab at [this link](https://www.airnow.gov/international/us-embassies-and-consulates/#India$New_Delhi)

* The data includes hourly measures of particulate matter (PM2.5), as well as the associated [Air Quality Index](https://www.airnow.gov/aqi/aqi-basics/) (AQI) and Air Quality Index Category.


## Temperature Data
* Again this week we'll be working with Global Historical Climatology from [NOAA](https://www.ncdc.noaa.gov/cdo-web/search) - this time for daily air temperature in New Delhi.
* We've exported the 'Daily Summaries' data for New Delhi City in °F for 2015-2020.
* Menne, Matthew J., Imke Durre, Bryant Korzeniewski, Shelley McNeal, Kristy Thomas, Xungang Yin, Steven Anthony, Ron Ray, Russell S. Vose, Byron E.Gleason, and Tamara G. Houston (2012): Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. NOAA National Climatic Data Center. doi:10.7289/V5D21VHZ [Aug 2021]

# PART 1: SEASONALITY OF AQI IN NEW DELHI
We'll plot the hourly Air Quality Index in New Delhi longitudinally over the past 5 years.


## PART 1.1 Read in New Delhi AQI Data

Install any packages you need and read in the AQI data.

The raw CSV links are available from the airnow.gov website, so we'll read the data directly using the following code:



```
# a list of URLS with data for each year
links =['https://dosairnowdata.org/dos/historical/NewDelhi/2015/NewDelhi_PM2.5_2015_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2016/NewDelhi_PM2.5_2016_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2017/NewDelhi_PM2.5_2017_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2018/NewDelhi_PM2.5_2018_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2019/NewDelhi_PM2.5_2019_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2020/NewDelhi_PM2.5_2020_YTD.csv']

# This is called list comprehension: it will call read_csv() on every element in the list called links
# It will return a new list of dataframes
dfs = [pd.read_csv(i) for i in links]

# Concatenate all the dataframes in the list dfs into one big data frame
df = pd.concat(dfs)

# preview data
df.head()
```



In [2]:
import pandas as pd
import plotnine
from plotnine import *

In [3]:
links =['https://dosairnowdata.org/dos/historical/NewDelhi/2015/NewDelhi_PM2.5_2015_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2016/NewDelhi_PM2.5_2016_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2017/NewDelhi_PM2.5_2017_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2018/NewDelhi_PM2.5_2018_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2019/NewDelhi_PM2.5_2019_YTD.csv',
       'https://dosairnowdata.org/dos/historical/NewDelhi/2020/NewDelhi_PM2.5_2020_YTD.csv']

In [4]:
dfs = [pd.read_csv(i) for i in links]


In [5]:
df = pd.concat(dfs)
df.head()


Unnamed: 0,Site,Parameter,Date (LT),Year,Month,Day,Hour,NowCast Conc.,AQI,AQI Category,Raw Conc.,Conc. Unit,Duration,QC Name
0,New Delhi,PM2.5 - Principal,2015-01-01 01:00 AM,2015,1,1,1,-999.0,-999,,-999.0,UG/M3,1 Hr,Missing
1,New Delhi,PM2.5 - Principal,2015-01-01 02:00 AM,2015,1,1,2,-999.0,-999,,-999.0,UG/M3,1 Hr,Missing
2,New Delhi,PM2.5 - Principal,2015-01-01 03:00 AM,2015,1,1,3,-999.0,-999,,-999.0,UG/M3,1 Hr,Missing
3,New Delhi,PM2.5 - Principal,2015-01-01 04:00 AM,2015,1,1,4,-999.0,-999,,-999.0,UG/M3,1 Hr,Missing
4,New Delhi,PM2.5 - Principal,2015-01-01 05:00 AM,2015,1,1,5,-999.0,-999,,-999.0,UG/M3,1 Hr,Missing


## PART 1.2: Examine Data

Do the following to orient to the data and make any initial cleanings:
* Display the data size and data types.
* Look at the distribution of numerical values.
* For each categorical variable, show the unique values of that field.
* Clean data values as necessary.

Note: the `QC Name` field is a Quality Conctrol label for the `Raw Conc.` field. The AQI field is based on the `NowCast Conc.`, which is based on the algorithmically calculated [NowCast](https://www.airnow.gov/aqi/aqi-basics/using-air-quality-index/#:~:text=What%20time%20frame%20it%20covers%3A%20The%20NowCast%20shows%20you%20air,such%20as%20during%20a%20wildfire.) value. The `QC Name` field does not describe quality of the `AQI` field directly.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51606 entries, 0 to 8782
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Site           51606 non-null  object 
 1   Parameter      51606 non-null  object 
 2   Date (LT)      51606 non-null  object 
 3   Year           51606 non-null  int64  
 4   Month          51606 non-null  int64  
 5   Day            51606 non-null  int64  
 6   Hour           51606 non-null  int64  
 7   NowCast Conc.  51606 non-null  float64
 8   AQI            51606 non-null  int64  
 9   AQI Category   48758 non-null  object 
 10  Raw Conc.      51606 non-null  float64
 11  Conc. Unit     51606 non-null  object 
 12  Duration       51606 non-null  object 
 13  QC Name        51606 non-null  object 
dtypes: float64(2), int64(5), object(7)
memory usage: 7.9+ MB


In [9]:
df.describe()

Unnamed: 0,Year,Month,Day,Hour,NowCast Conc.,AQI,Raw Conc.
count,51606.0,51606.0,51606.0,51606.0,51606.0,51606.0,51606.0
mean,2017.473627,6.583033,15.737104,11.501085,44.62254,105.411231,73.562623
std,1.713099,3.44003,8.82688,6.921447,272.007072,281.946401,319.284588
min,2015.0,1.0,1.0,0.0,-999.0,-999.0,-999.0
25%,2016.0,4.0,8.0,6.0,33.5,96.0,33.0
50%,2017.0,7.0,16.0,12.0,63.3,155.0,64.0
75%,2019.0,10.0,23.0,17.75,133.1,191.0,136.0
max,2021.0,12.0,31.0,23.0,1546.9,1191.0,1985.0


In [18]:
# Calculate the number of unique values for each specified column individually.
for col in ['Site', 'Parameter', 'Date (LT)', 'Year', 'Month', 'Day', 'Hour', 'NowCast Conc.', 'AQI', 'AQI Category', 'Raw Conc.', 'Conc. Unit', 'Duration', 'QC Name']:
    print(f"Column {col}: {df[col].nunique()} unique values")

Column Site: 1 unique values
Column Parameter: 1 unique values
Column Date (LT): 51602 unique values
Column Year: 7 unique values
Column Month: 12 unique values
Column Day: 31 unique values
Column Hour: 24 unique values
Column NowCast Conc.: 4487 unique values
Column AQI: 717 unique values
Column AQI Category: 6 unique values
Column Raw Conc.: 879 unique values
Column Conc. Unit: 1 unique values
Column Duration: 1 unique values
Column QC Name: 4 unique values


## PART 1.3: Filter for Non-missing Values

In Parts 1 and 2, we'll be focused on the `AQI` field. Filter `df` to include only records where `AQI` is not missing. Store this in a new dataframe `df_clean`.

## PART 1.4: Check Data Coverage

Get a bird's-eye view of the coverage of our data observations in `df_clean`. Specifically, check:

* Do we have duplicate records for any dates and hours? If you find duplicates, remove the duplicates to tidy the data. Assume that the second observation (from top to bottom) is the 'corrected', updated record.
* Do we have at least one observation in every day between 2015 and 2020? If not, does there appear to be any skew in where we are missing daily observations across Years and Months?
* Do we have 24 hrs of data for each day where we have at least one observation? If not, does there appear to be any skew in where we are missing hourly observations, across a) years and months and b) times of day? (i.e. Check that any missing hourly data is not concentrated in certain year and months, or times of day. It's ok to check for skew in time of day independent of years and month. We just want to make sure there isn't a large pattern/bias to the missing hourly observations.)

Make a note of what you observe. (Note: There are a lot of valid approaches for these checks! Do what makes the most sense to you in order to notice any massive skews in the data.)

## PART 1.5: Create Time Series Plot
Now, we'll use `plotnine` to create a graph of `AQI` in New Delhi over time. To do this:

* You'll want to create and use a new column in `df_clean` with the `Date (LT)` field cast as a `datetime` data type - currently `Date (LT)` is a string.

  * We haven't talked about datatime data types, but they are just another datatype like `str` or `int`. The advantage is that `plotnine` will recognize the data as a date and do some automatic formatting for us. To cast the date field as a datetime type, call:

    ```
   df_clean['Date'] = pd.to_datetime(df_clean['Date (LT)'])
    ```




* Create your plot showing in detail the AQI over time
* Format the chart to make it as clear and interpretable as possible.

  * Hint: If you are looking to adjust the x axis, you can use `+ scale_x_date()`. `plotnine` treats dates as something inbetween continuous and discrete variables.

## PART 1.6: Interpret Your Chart
What is your chart saying? Is there a pattern to air quality in New Delhi? What is a limitaion of this chart/analysis? Write a few sentences.


# PART 2: RELATIONSHIP BETWEEN TEMPERATURE AND AQI
Let's explore the relationship between air temperature and air quality in New Delhi. We'll make a bar chart with the average AQI for different temperature ranges.


### PART 2.1: Read in New Delhi Temperature Data



Read in the air temperature data into a dataframe and orient to it. We'll be using the average daily temperature field, `TAVG`. The data is available on GitHub at the following link:

```
https://raw.githubusercontent.com/envirodatascience/ENVS-617-Class-Data/main/noaa_new_delhi_temp_15_20.csv
```




### PART 2.2: Create Clean Daily Summary Table

Ultimately, we want a table with a single air temperature record for New Delhi for each day in the 2015 to 2020 window. (In the next part, we will merge this with the AQI data.) To do this:

* First, answer: how many weather stations do we have data for in New Delhi?

* Then, filter your data to include only the data for the station with the most data records.

* Split out the `DATE` column into separate year, month, and day columns.

* Check that we now only have one data point per day. Check the coverage of the data to make sure we have a fair sample across years and months between 2015 and 2020.

### PART 2.3: Join Temperature and Air Quality Data
Join together the clean hourly AQI data and the filtered daily temperature data:

* Join on year, month, and day.
* Only keep the days where you have a record in both datasets.
* The AQI data is hourly, so you will have ~24 rows per day. This is ok. * Check the shapes of the dataframes before and after the merge.


Hint: You will need to create a key (or keys) with the same format across dataframes.



### PART 2.4: Convert and Bucket Temperature and Aggregate Data

To compare temperature and AQI, we will look at the average hourly AQI for bucketed temperature ranges. To do this, transform the temperature field and summarize the data:

* Let's assume that this chart will be for an audience in India. Convert the temperature data to Celsius.


* Then, create a new column that buckets the temperature values. Create the following buckets (exclusive of the lower boundary, inclusive of the upper boundary)
  * \<10,
  * 10-15
  * 15-20
  * 20-25
  * 25-30
  * 30-35
  * 35-40

* Then, aggregate the data by temperature bucket and calculate a) the mean AQI for that temperature range and b) the standard deviation of the AQI for that temperature range



### PART 2.5: Plot a Bar Chart of Bucketed Temperature vs. AQI

Use `plotnine` to create a bar graph of temperature bucket vs. average hourly AQI in New Delhi:
* Include information about the AQI standard deviation.
* Fine tune the plot to make it as clear as possible.

### PART 2.6: Interpret Your Chart
What is the chart saying? Is there a relationship between air temperature and air quality in New Delhi? What is a limitaion of this chart/analysis?Write a few sentences.