<a href="https://colab.research.google.com/github/tort-cam/ST-554-P1/blob/main/Task1/task01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Title: Task 1

Author: Bryan Sandor

# Initialization

The data being made available through the UCI website, a package must be installed before the appropriate file may be accessed and the necessary variables saved locally.

In [None]:
!pip install ucimlrepo

import pandas as pd
import math as math

from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
airdata = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)

# look at a few of the observations
airdata

{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C59K5F', 'creators': ['Saverio Vito'], 'intro_paper': {'ID': 420, 'type': 'NATIVE', 'title': 'On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario', 'authors': 

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,4/4/2005,10:00:00,3.1,1314,-200,13.5,1101,472,539,190,1374,1729,21.9,29.3,0.7568
9353,4/4/2005,11:00:00,2.4,1163,-200,11.4,1027,353,604,179,1264,1269,24.3,23.7,0.7119
9354,4/4/2005,12:00:00,2.4,1142,-200,12.4,1063,293,603,175,1241,1092,26.9,18.3,0.6406
9355,4/4/2005,13:00:00,2.1,1003,-200,9.5,961,235,702,156,1041,770,28.3,13.5,0.5139


Now we delete any observations where `C6H6(GT)` or `PT08.S1(CO)` are $-200$, which represent missing values.

In [None]:
sum(airdata["C6H6(GT)"] == -200) # count number of observations to delete

366

In [None]:
for i in range(len(airdata)):
    if (airdata["C6H6(GT)"].loc[i] == -200):
        airdata.drop(i, inplace = True) # inplace updates dataframe actively

In [None]:
sum(airdata["C6H6(GT)"] == -200)

0

Notice the observations indicating missing values for `C6H6(GT)` must have also been missing values for `PT08.S1(CO)` since the number of observations indicating missing values for the latter is now also $0$. A `len` command also verifies $366$ observations have been removed from the dataframe.

In [None]:
sum(airdata["PT08.S1(CO)"] == -200)

0

In [None]:
len(airdata)

8991

# Grid Search

## Just `y`

In [271]:
def loss(y = "C6H6(GT)", c = 0):

    """
    This function takes a column from the airdata data frame and a fixed value c and returns the root mean square error of all the observations in the desired column and c.
    """

    MSE = 0 # initialize mean square error (MSE)

    for i in range(len(airdata)):
        MSE += ( airdata[y].iloc[i] - c ) ** 2 # calculate MSE

    RMSE = math.sqrt(MSE) # calculate root MSE (RMSE)

    return RMSE

In [287]:
def makegrid(y = "C6H6(GT)", n = 1):

    """
    This function creates a list for a grid of values spanning from the first quartile to the third quartile with the step size determined by the number of steps in the span, n.
    """

    mesh = (airdata[y].quantile(0.75) - airdata[y].quantile(0.25))/n
    grid = [x * mesh + airdata[y].quantile(0.25) for x in range(n)]
    return grid

In [364]:
def gridsearch(y = "C6H6(GT)", n = 1):
    """
    This function takes a column from the airdata, y, and a desired number of steps, n, creates a grid and runs the loss function on each value within the grid to create a list. It then searches the list for the minimum root mean square error and the position within the grid where it occurs, returning both.
    """

    if n > 1000: # Program takes approximately 1 minute to run with this cap
        print("Please choose a smaller number of iterations.")
        return

    grid = makegrid(y = y, n = n) # create a grid with given parameters

    rmsegrid = [[grid[i], loss(y = y, c = grid[i])] for \
                i in range(n)] # list comprehension to compute the root mean square error for all the values within the grid

    gridminrmse = 1e32 # initializes an error value to compare within list

    for i in range(n): # finds the smallest RMSE and its location in the grid
        if rmsegrid[i][1] < gridminrmse:
            gridminrmse = rmsegrid[i][1]
            gridref = rmsegrid[i][0]

    return [gridref, gridminrmse]

In [361]:
gridsearch(n = 1000)

[np.float64(10.0832), 706.3592030828498]

In [362]:
airdata["C6H6(GT)"].mean()

np.float64(10.083105327549772)

In [363]:
loss(c = airdata["C6H6(GT)"].mean())

706.3592030258148

In [366]:
gridsearch(y = "PT08.S1(CO)", n = 1000)

[np.float64(1099.876), 20582.576666822246]

In [367]:
airdata["PT08.S1(CO)"].mean()

np.float64(1099.8331664998332)

In [368]:
loss(y = "PT08.S1(CO)", c = airdata["PT08.S1(CO)"].mean())

20582.576266098204

## Using `y` and another numeric variable `x`

# Gradient Descent

## Just `y`

## Using `y` and another numeric variable `x`