<a href="https://colab.research.google.com/github/ykim68ncstate/ST-554-Project1/blob/t2-dev/Project_1_Task_2_David_Pressley.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - On Field Calibration of Electronic Nose for Benzene Estimation

## Introduction
Low cost gas sensors used in urban air pollution monitoring face two key problems:


1.   *Specificity* Available sensors are unable to distinguish between gases well
2.   *Drift*: Solid-state sensors measurements drift making their quantitative measurements unreliable over time, due to:
*   intrinsic physical and chemical properties
*   environmental temperature fluctuations
*   exposure to the environment
*   oxidation of the sensor surface

The manufacturer specifies less than 2% drift per 6 months (DeVito, 2008)

## Background
This dataset comes from the UC Irvine Machine Learning repository and contains the reponses of a gas multisensor device deployed in a field in a polluted area in an Italian city. Data were captured from March 2004 to February 2005 (one year), and represent the longest freely available recordings of air quality sensor device measurements.

### Dataset Characteristics

**Collection Context**:
- **Location**: Road-level sensor in a significantly polluted urban area in Italy
- **Duration**: March 2004 to February 2005 (12 months)
- **Instances**: 9,358 hourly averaged observations
- **Device**: Air Quality Chemical Multisensor Device with 5 metal oxide sensors

**Variable Types**:

| Type | Variables | Source |
|------|-----------|--------|
| Ground truth (GT) | CO(GT), NMHC(GT), C6H6(GT), NOx(GT), NO2(GT) | Co-located certified reference analyzer |
| Sensor responses | PT08.S1(CO), PT08.S2(NMHC), PT08.S3(NOx), PT08.S4(NO2), PT08.S5(O3) | Metal oxide sensor array |
| Environmental | T (temperature), RH (relative humidity), AH (absolute humidity) | Weather measurements |

**Data Quality Issues**:
- **Missing values**: Tagged with `-200`, requires handling
- **Cross-sensitivities**: Sensors respond to multiple gases, not just their target analyte
- **Concept drift**: Underlying pollutant mixture composition changes seasonally (e.g., winter heating increases NOx)
- **Sensor drift**: Baseline resistance changes over the 12-month deployment period

**Usage Restrictions**: Research purposes only; commercial use fully excluded. Citation of De Vito et al., Sens. And Act. B, Vol. 129, 2, 2008 required.






## Task 2 - Exploratory Data Analysis

### Data Import and Description

In [1]:
# import ucimlrepo
# run once per session
!pip install ucimlrepo


# import air quality dataset
from ucimlrepo import fetch_ucirepo
from pprint import pprint
from IPython.display import HTML, display, Markdown
from google.colab import data_table

# load pandas as data table for viewing pleasure
data_table.enable_dataframe_formatter()

# fetch dataset, confirm type as dictionary (can address keys as attributes with dot notation)
air_quality_raw = fetch_ucirepo(id=360)

# type is dict <class 'ucimlrepo.dotdict.dotdict'>, so get keys
print(f"Raw Data Type: {type(air_quality_raw)}")

#keys
raw_keys = [key for key in air_quality_raw.keys()]
display(f"Raw Data Dictionary Keys: {raw_keys}")

# Loop through raw keys
for key in raw_keys:
    value = air_quality_raw[key]
    print(f"Air Quality Raw Keys: {key} : Types: {type(value).__name__}")

    if hasattr(air_quality_raw[key], 'keys'):
        print(f"Keys in `{key}`: {air_quality_raw[key].keys()}")

    # use key.title() and check for shape attribute before accessing it
    if hasattr(value, 'shape'):
        print(f"{key.title()} Shape: {value.shape}")
    else:
        print(f"{key.title()} (type: {type(value).__name__}) xxhas no shape attribute.")

if hasattr(air_quality_raw['data'], 'keys'):
    print(f"Data Keys: {air_quality_raw['data'].keys()}")
    # The 'data' container itself is a dictionary and has no shape

if hasattr(air_quality_raw['metadata'], 'keys'):
    print(f"Metadata Keys: {air_quality_raw['metadata'].keys()}")

if hasattr(air_quality_raw['variables'], 'keys'):
    print(f"Variable Keys: {air_quality_raw['variables'].keys()}")


air_quality_raw.variables



# variable information
display(HTML(air_quality_raw.variables.to_html()))
# as markdown
display(Markdown(air_quality_raw.variables.to_markdown()))

# convert raw data dictionaries to dataframes
raw_features = air_quality_raw.data.features
raw_targets = air_quality_raw.data.targets

def print_structure(d, indent=0):
    """Recursively print the structure of a dictionary or dotdict."""
    for key, value in d.items():
        prefix = '  ' * indent
        print(f"{prefix}- {key}: {type(value).__name__}")

        # if it's a dict, recurse
        if hasattr(value, 'keys') and not hasattr(value, 'shape'): # Avoid recursing into DataFrames
             print_structure(value, indent + 1)
        # dataframes have shape.
        #TODO: determine other summary stats
        elif hasattr(value, 'shape'):
             print(f"{prefix}  Shape: {value.shape}")

print("Structure of air_quality_raw:")
print_structure(air_quality_raw)

# metadata
# pprint(air_quality_raw.metadata)

# missing values?
print(f"Missing Values: {air_quality_raw.missing}")


# air_quality_raw.shape is None. check the dataframe shape instead
print(f"Dataset Shape: {raw_features.shape}")

print("\nInformation about raw_features DataFrame:")
raw_features.info()

print("\nFirst 5 rows of raw_features DataFrame:")
display(raw_features.head())

print("\nInformation about raw_features DataFrame:")
raw_features.info()

print("\nFirst 5 rows of raw_features DataFrame:")
display(raw_features.head())

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Raw Data Type: <class 'ucimlrepo.dotdict.dotdict'>


"Raw Data Dictionary Keys: ['data', 'metadata', 'variables']"

Air Quality Raw Keys: data : Types: dotdict
Keys in `data`: dict_keys(['ids', 'features', 'targets', 'original', 'headers'])
Data Shape: None
Air Quality Raw Keys: metadata : Types: dotdict
Keys in `metadata`: dict_keys(['uci_id', 'name', 'repository_url', 'data_url', 'abstract', 'area', 'tasks', 'characteristics', 'num_instances', 'num_features', 'feature_types', 'demographics', 'target_col', 'index_col', 'has_missing_values', 'missing_values_symbol', 'year_of_dataset_creation', 'last_updated', 'dataset_doi', 'creators', 'intro_paper', 'additional_info'])
Metadata Shape: None
Air Quality Raw Keys: variables : Types: DataFrame
Keys in `variables`: Index(['name', 'role', 'type', 'demographic', 'description', 'units',
       'missing_values'],
      dtype='object')
Variables Shape: (15, 7)
Data Keys: dict_keys(['ids', 'features', 'targets', 'original', 'headers'])
Metadata Keys: dict_keys(['uci_id', 'name', 'repository_url', 'data_url', 'abstract', 'area', 'tasks', 'characteristics', 'nu

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Date,Feature,Date,,,,no
1,Time,Feature,Categorical,,,,no
2,CO(GT),Feature,Integer,,True hourly averaged concentration CO in mg/m^3 (reference analyzer),mg/m^3,no
3,PT08.S1(CO),Feature,Categorical,,hourly averaged sensor response (nominally CO targeted),,no
4,NMHC(GT),Feature,Integer,,True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer),microg/m^3,no
5,C6H6(GT),Feature,Continuous,,True hourly averaged Benzene concentration in microg/m^3 (reference analyzer),microg/m^3,no
6,PT08.S2(NMHC),Feature,Categorical,,hourly averaged sensor response (nominally NMHC targeted),,no
7,NOx(GT),Feature,Integer,,True hourly averaged NOx concentration in ppb (reference analyzer),ppb,no
8,PT08.S3(NOx),Feature,Categorical,,hourly averaged sensor response (nominally NOx targeted),,no
9,NO2(GT),Feature,Integer,,True hourly averaged NO2 concentration in microg/m^3 (reference analyzer),microg/m^3,no


|    | name          | role    | type        | demographic   | description                                                                                            | units      | missing_values   |
|---:|:--------------|:--------|:------------|:--------------|:-------------------------------------------------------------------------------------------------------|:-----------|:-----------------|
|  0 | Date          | Feature | Date        |               |                                                                                                        |            | no               |
|  1 | Time          | Feature | Categorical |               |                                                                                                        |            | no               |
|  2 | CO(GT)        | Feature | Integer     |               | True hourly averaged concentration CO in mg/m^3  (reference analyzer)                                  | mg/m^3     | no               |
|  3 | PT08.S1(CO)   | Feature | Categorical |               | hourly averaged sensor response (nominally  CO targeted)                                               |            | no               |
|  4 | NMHC(GT)      | Feature | Integer     |               | True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer) | microg/m^3 | no               |
|  5 | C6H6(GT)      | Feature | Continuous  |               | True hourly averaged Benzene concentration  in microg/m^3 (reference analyzer)                         | microg/m^3 | no               |
|  6 | PT08.S2(NMHC) | Feature | Categorical |               | hourly averaged sensor response (nominally NMHC targeted)                                              |            | no               |
|  7 | NOx(GT)       | Feature | Integer     |               | True hourly averaged NOx concentration  in ppb (reference analyzer)                                    | ppb        | no               |
|  8 | PT08.S3(NOx)  | Feature | Categorical |               | hourly averaged sensor response (nominally NOx targeted)                                               |            | no               |
|  9 | NO2(GT)       | Feature | Integer     |               | True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)                              | microg/m^3 | no               |
| 10 | PT08.S4(NO2)  | Feature | Categorical |               | hourly averaged sensor response (nominally NO2 targeted)                                               |            | no               |
| 11 | PT08.S5(O3)   | Feature | Categorical |               | hourly averaged sensor response (nominally O3 targeted)                                                |            | no               |
| 12 | T             | Feature | Continuous  |               | Temperature                                                                                            | °C         | no               |
| 13 | RH            | Feature | Continuous  |               | Relative Humidity                                                                                      | %          | no               |
| 14 | AH            | Feature | Continuous  |               | Absolute Humidity                                                                                      |            | no               |

Structure of air_quality_raw:
- data: dotdict
  Shape: None
- metadata: dotdict
  Shape: None
- variables: DataFrame
  Shape: (15, 7)
Missing Values: None
Dataset Shape: (9357, 15)

Information about raw_features DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9357 non-null   int64  
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   int64  
 7   NOx(GT)        9357 non-null   int64  
 8   PT08.S3(NOx)   9357 non-null   int64  
 9   NO2(GT)        9357 non-null   int64  
 10  PT08.S4(NO2)   9357 non-null   int64  
 11  PT08.S5(O3)    9357 non-null   int64  
 12  T              9357 non-null   float64
 13  RH  

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888



Information about raw_features DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9357 non-null   int64  
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   int64  
 7   NOx(GT)        9357 non-null   int64  
 8   PT08.S3(NOx)   9357 non-null   int64  
 9   NO2(GT)        9357 non-null   int64  
 10  PT08.S4(NO2)   9357 non-null   int64  
 11  PT08.S5(O3)    9357 non-null   int64  
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB

First 5 rows of raw_features DataFrame:

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


| name          | role    | type        | units      |
|:--------------|:--------|:------------|:-----------|
| Date          | Feature | Date        |            |
| Time          | Feature | Categorical |            |
| CO(GT)        | Feature | Integer     | mg/m^3     |
| PT08.S1(CO)   | Feature | Categorical |            |
| NMHC(GT)      | Feature | Integer     | microg/m^3 |
| C6H6(GT)      | Feature | Continuous  | microg/m^3 |
| PT08.S2(NMHC) | Feature | Categorical |            |
| NOx(GT)       | Feature | Integer     | ppb        |
| PT08.S3(NOx)  | Feature | Categorical |            |
| NO2(GT)       | Feature | Integer     | microg/m^3 |
| PT08.S4(NO2)  | Feature | Categorical |            |
| PT08.S5(O3)   | Feature | Categorical |            |
| T             | Feature | Continuous  | °C         |
| RH            | Feature | Continuous  | %          |
| AH            | Feature | Continuous  |            |



In [2]:
# Step 1: Check the type of each value in the top-level keys
print("Types of top-level values:")
for key in air_quality_raw.keys():
    print(f"  {key}: {type(air_quality_raw[key])}")

# Step 2: Since 'data' is likely a container, what are keys
if hasattr(air_quality_raw['data'], 'keys'):
    print(f"\nKeys inside 'data': {air_quality_raw['data'].keys()}")

Types of top-level values:
  data: <class 'ucimlrepo.dotdict.dotdict'>
  metadata: <class 'ucimlrepo.dotdict.dotdict'>
  variables: <class 'pandas.core.frame.DataFrame'>

Keys inside 'data': dict_keys(['ids', 'features', 'targets', 'original', 'headers'])


: 

In [82]:
def print_structure(d, indent=0):
    """Recursively print the structure of a dictionary or dotdict."""
    for key, value in d.items():
        prefix = '  ' * indent
        print(f"{prefix}- {key}: {type(value).__name__}")

        # If it's a dictionary (or dotdict), recurse
        if hasattr(value, 'keys') and not hasattr(value, 'shape'): # Avoid recursing into DataFrames
             print_structure(value, indent + 1)
        # If it has a shape (like DataFrame/Array), print it
        elif hasattr(value, 'shape'):
             print(f"{prefix}  Shape: {value.shape}")

print("Structure of air_quality_raw:")
print_structure(air_quality_raw)

Structure of air_quality_raw:
- data: dotdict
  Shape: None
- metadata: dotdict
  Shape: None
- variables: DataFrame
  Shape: (15, 7)


: 

### Handling Missing Values

Based on the previous analysis, we will replace the `-200` placeholder values with `NaN` to properly handle missing data.

The `raw_features` DataFrame contains the independent variables for the Air Quality dataset.


*   Inspect first few rows
*   Summary of columns
*   Summary of Data Types



In [None]:
print("\nInformation about raw_features DataFrame:")
raw_features.info()

print("\nFirst 5 rows of raw_features DataFrame:")
display(raw_features.head())


Information about raw_features DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9357 non-null   int64  
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   int64  
 7   NOx(GT)        9357 non-null   int64  
 8   PT08.S3(NOx)   9357 non-null   int64  
 9   NO2(GT)        9357 non-null   int64  
 10  PT08.S4(NO2)   9357 non-null   int64  
 11  PT08.S5(O3)    9357 non-null   int64  
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB

First 5 rows of raw_features DataFrame:

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


: 

In [8]:
print(type(raw_features))
print(type(raw_targets))

<class 'pandas.core.frame.DataFrame'>
<class 'NoneType'>


: 

### Data Cleaning

In [4]:
# Check for standard NaN missing values
print("\nMissing values (NaN) in raw_features:")
print(raw_features.isnull().sum())

# Check for missing values represented by -200 (common in this dataset type)
missing_200 = (raw_features == -200).sum()
print("\nMissing values (-200) in raw_features:")
print(missing_200)


Missing values (NaN) in raw_features:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Missing values (-200) in raw_features:
Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64


: 