<a href="https://colab.research.google.com/github/shuodeng521-sys/ST-554-Project1-Shuo-Anna-Jillian/blob/main/Task2/Greene_Project1_Task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

Monitoring carcinogenic atmospheric pollutants is imperative for mitigating negative public health crises, though current sensing technologies are expensive and labor-intensive. [De Vito et al. (2008)](https://doi.org/10.1016/j.snb.2007.09.060) suggest a low-cost gas multi-sensor device capable or providing high resolution data for CO, NMHCs, NOx, NO2, and O3 with a machine learning approach for calibration. If successful, this type of set up could provide city, state, and federal planners and managers with critical information on how to protect human health. In this project, our team will be exploring the data to find the optimal way to set up a model and analyze covariates for the sensor. Data can be downloaded from the [UCI ML Repository](https://archive.ics.uci.edu/dataset/360/air+quality) or read in following to code below.

In [11]:
# Install UCI package - only needs to be done 1st time
# !pip install ucimlrepo

# Import the package
from ucimlrepo import fetch_ucirepo

# Fetch the Air Quality Multisensor dataset
air_quality = fetch_ucirepo(id=360)

# Print var info
air_quality.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Date,Feature,Date,,,,no
1,Time,Feature,Categorical,,,,no
2,CO(GT),Feature,Integer,,True hourly averaged concentration CO in mg/m^...,mg/m^3,no
3,PT08.S1(CO),Feature,Categorical,,hourly averaged sensor response (nominally CO...,,no
4,NMHC(GT),Feature,Integer,,True hourly averaged overall Non Metanic Hydro...,microg/m^3,no
5,C6H6(GT),Feature,Continuous,,True hourly averaged Benzene concentration in...,microg/m^3,no
6,PT08.S2(NMHC),Feature,Categorical,,hourly averaged sensor response (nominally NMH...,,no
7,NOx(GT),Feature,Integer,,True hourly averaged NOx concentration in ppb...,ppb,no
8,PT08.S3(NOx),Feature,Categorical,,hourly averaged sensor response (nominally NOx...,,no
9,NO2(GT),Feature,Integer,,True hourly averaged NO2 concentration in micr...,microg/m^3,no


In [14]:
# Import auxillary packages for Task 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [12]:
# Convert to pandas df
air_quality = pd.DataFrame(air_quality.data.features)

# Rename columns
air_quality.rename(columns = {'CO(GT)' : "True_CO", "PT08.S1(CO)" : "PT_CO", "NMHC(GT)" : "True_NMHC",
                              "C6H6(GT)" : "True_C6H6", "PT08.S2(NMHC)" : "PT_NMHC", "NOx(GT)" : "True_NOx",
                              "PT08.S3(NOx)" : "PT_NOx", "NO2(GT)" : "True_NO2", "PT08.S4(NO2)" : "PT_NO2",
                              "PT08.S5(O3)" : "PT_NO3"}, inplace = True)

air_quality.head()

Unnamed: 0,Date,Time,True_CO,PT_CO,True_NMHC,True_C6H6,PT_NMHC,True_NOx,PT_NOx,True_NO2,PT_NO2,PT_NO3,T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


In [16]:
# Check for NAs
# In the metadata, -200s are NA, so set accordingly
air_quality.replace(-200, np.nan, inplace = True)

air_quality.isna().sum()

Unnamed: 0,0
Date,0
Time,0
True_CO,1683
PT_CO,366
True_NMHC,8443
True_C6H6,366
PT_NMHC,366
True_NOx,1639
PT_NOx,366
True_NO2,1642


There are many rows with NAs, particularly in the 'True' columns which is consistent with ML datasets, i.e. this is why we need a model! For this exploratory analysis, it would be overzealous to remove all NAs. I will remove NAs in the sensor columns which seem consistent throughout all variables.

In [17]:
# Remove NAs in 1 sensor column (PT_CO) and check for NAs to see if that addresses all
air_quality.dropna(subset = ["PT_CO"], inplace = True)
air_quality.isna().sum()

Unnamed: 0,0
Date,0
Time,0
True_CO,1647
PT_CO,0
True_NMHC,8104
True_C6H6,0
PT_NMHC,0
True_NOx,1595
PT_NOx,0
True_NO2,1598


All NAs were removed in the sensor and climate columns by removing NAs just in the sensor CO column as predicted. This has addressed the issue and I can move on to the Task 2 analysis.

## Heading 1