<a href="https://colab.research.google.com/github/tort-cam/ST-554-P1/blob/main/Task1/task01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Title: Task 1

Author: Bryan Sandor

# Initialization

The data being made available through the UCI website, a package must be installed before the appropriate file may be accessed and the necessary variables saved locally.

In [1]:
!pip install ucimlrepo

import pandas as pd
import math as math

from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
airdata = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)

# look at a few of the observations
airdata

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.2

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,4/4/2005,10:00:00,3.1,1314,-200,13.5,1101,472,539,190,1374,1729,21.9,29.3,0.7568
9353,4/4/2005,11:00:00,2.4,1163,-200,11.4,1027,353,604,179,1264,1269,24.3,23.7,0.7119
9354,4/4/2005,12:00:00,2.4,1142,-200,12.4,1063,293,603,175,1241,1092,26.9,18.3,0.6406
9355,4/4/2005,13:00:00,2.1,1003,-200,9.5,961,235,702,156,1041,770,28.3,13.5,0.5139


Now we delete any observations where `C6H6(GT)` or `PT08.S1(CO)` are $-200$, which represent missing values.

In [2]:
sum(airdata["C6H6(GT)"] == -200) # count number of observations to delete

366

In [3]:
for i in range(len(airdata)):
    if (airdata["C6H6(GT)"].loc[i] == -200):
        airdata.drop(i, inplace = True) # inplace updates dataframe actively

In [4]:
sum(airdata["C6H6(GT)"] == -200)

0

Notice the observations indicating missing values for `C6H6(GT)` must have also been missing values for `PT08.S1(CO)` since the number of observations indicating missing values for the latter is now also $0$. A `len` command also verifies $366$ observations have been removed from the dataframe.

In [5]:
sum(airdata["PT08.S1(CO)"] == -200)

0

In [6]:
len(airdata)

8991

# Grid Search

## Just `y`

We first create three functions:
1. A loss function, which calculates and returns the root mean square error (RMSE) between all the observations and a fixed constant, `c`.

In [7]:
def loss(y = "C6H6(GT)", c = 0):

    """
    This function takes a column from the airdata data frame and a fixed value c and returns the root mean square error of all the observations in the desired column and c.
    """

    MSE = 0 # initialize mean square error (MSE)

    for i in range(len(airdata)):
        MSE += ( airdata[y].iloc[i] - c ) ** 2 # calculate MSE

    RMSE = math.sqrt(MSE) # calculate root MSE (RMSE)

    return RMSE

2. A function creating a uniform "grid" of values initializing and terminating at two given values and a fixed number of "steps".

In [8]:
def makegrid(y = "C6H6(GT)", a = airdata["C6H6(GT)"].quantile(0.25), \
               b = airdata["C6H6(GT)"].quantile(0.75), n = 1):

    """
    This function creates a list for a grid of values spanning from the first quartile to the third quartile with the step size determined by the number of steps in the span, n.
    """

    mesh = (b - a)/n
    grid = [x * mesh + a for x in range(n)]
    return grid

3. A function using the previous two to create and scour the grid, using those different values for `c` to find the smallest RMSE, returning both its value and the value for `c` where it occurs.

In [9]:
def gridsearch(y = "C6H6(GT)", a = airdata["C6H6(GT)"].quantile(0.25), \
               b = airdata["C6H6(GT)"].quantile(0.75), n = 1):
    """
    This function takes a column from the airdata, y, a lower and upper bound of values to begin and terminate search, and a desired number of steps, n, creates a grid and runs the loss function on each value within the grid to create a list. It then searches the list for the minimum root mean square error and the position within the grid where it occurs, returning both.
    """

    if n > 1000: # Program takes approximately 1 minute to run with this cap
        print("Please choose a smaller number of iterations.")
        return

    grid = makegrid(y = y, a = a, b = b, n = n) # create a grid with given parameters

    rmsegrid = [[grid[i], loss(y = y, c = grid[i])] for \
                i in range(n)] # list comprehension to compute the root mean square error for all the values within the grid

    gridminrmse = 1e32 # initializes an error value to compare within list

    for i in range(n): # finds the smallest RMSE and its location in the grid
        if rmsegrid[i][1] < gridminrmse:
            gridminrmse = rmsegrid[i][1]
            gridref = rmsegrid[i][0]

    return [gridref, gridminrmse]

Running the function with a step size of $1000$ yields the following:

In [10]:
gridsearch(n = 1000)

[np.float64(10.0832), 706.3592030828498]

Note that the value of `c` yielding the minimum RMSE is the mean of the data, given below and approximately matching that found by the brute force function.

In [11]:
airdata["C6H6(GT)"].mean()

np.float64(10.083105327549772)

Similarly, we verify such a value for `c` produces an RMSE similar to that found by the function.

In [12]:
loss(c = airdata["C6H6(GT)"].mean())

706.3592030258148

We repeat the process, this time on the variable `PT08.S1(CO)` instead to verify the process works independent of variable chosen.

In [13]:
gridsearch(y = "PT08.S1(CO)", a = airdata["PT08.S1(CO)"].quantile(0.25), \
           b = airdata["PT08.S1(CO)"].quantile(0.75), n = 1000)

[np.float64(1099.876), 20582.576666822246]

In [14]:
airdata["PT08.S1(CO)"].mean()

np.float64(1099.8331664998332)

In [15]:
loss(y = "PT08.S1(CO)", c = airdata["PT08.S1(CO)"].mean())

20582.576266098204

## Using `y` and another numeric variable `x`

We now mimic the previous section, using instead a linear function to attempt to minimize the RMSE.

In [28]:
def loss2d(x = "PT08.S1(CO)", y = "C6H6(GT)", beta0 = 0, beta1 = 0):

    """
    This function takes two columns from the airdata data frame, two parameters for the linear regression, beta0 and beta1, and returns the root mean square error of the differences between the expected and observed values.
    """

    MSE = 0 # initialize mean square error (MSE)

    for i in range(len(airdata)):
        MSE += ( airdata[y].iloc[i] - \
                beta0 - beta1 * airdata[x].iloc[i] ) ** 2 # calculate MSE

    RMSE = math.sqrt(MSE) # calculate root MSE (RMSE)

    return RMSE

In [29]:
def makegrid2d(a = -25, b = -15, n = 1, c = -5, d = 5, m = 1):

    """
    This function creates a list for a 2d-grid of values spanning from the first endpoint to the second endpoint with the step size determined by the number of steps in the span, n and the third endpoint to the fourth endpoint with the step size determined by the number of steps in that span, m.
    """

    meshb0 = (b - a)/n
    meshb1 = (d - c)/m
    grid2d = [[0] * m for _ in range(n)]

    for i in range(n):
        for j in range(m):
            grid2d[i][j] = [meshb0 * i + a, meshb1 * j + c]

    return grid2d

In [40]:
grid2d = makegrid2d(a = -25, b = -15, n = 10, c = -5, d = 5, m = 100) # this is temporary, used to check the code tested below

In [54]:
grid2d # checking the grid values

[[[-25.0, -5.0],
  [-25.0, -4.9],
  [-25.0, -4.8],
  [-25.0, -4.7],
  [-25.0, -4.6],
  [-25.0, -4.5],
  [-25.0, -4.4],
  [-25.0, -4.3],
  [-25.0, -4.2],
  [-25.0, -4.1],
  [-25.0, -4.0],
  [-25.0, -3.9],
  [-25.0, -3.8],
  [-25.0, -3.7],
  [-25.0, -3.5999999999999996],
  [-25.0, -3.5],
  [-25.0, -3.4],
  [-25.0, -3.3],
  [-25.0, -3.2],
  [-25.0, -3.0999999999999996],
  [-25.0, -3.0],
  [-25.0, -2.9],
  [-25.0, -2.8],
  [-25.0, -2.6999999999999997],
  [-25.0, -2.5999999999999996],
  [-25.0, -2.5],
  [-25.0, -2.4],
  [-25.0, -2.3],
  [-25.0, -2.1999999999999997],
  [-25.0, -2.0999999999999996],
  [-25.0, -2.0],
  [-25.0, -1.9],
  [-25.0, -1.7999999999999998],
  [-25.0, -1.6999999999999997],
  [-25.0, -1.5999999999999996],
  [-25.0, -1.5],
  [-25.0, -1.4],
  [-25.0, -1.2999999999999998],
  [-25.0, -1.1999999999999997],
  [-25.0, -1.0999999999999996],
  [-25.0, -1.0],
  [-25.0, -0.8999999999999995],
  [-25.0, -0.7999999999999998],
  [-25.0, -0.7000000000000002],
  [-25.0, -0.59999999999999

In [32]:
def grid2dsearch(x = "PT08.S1(CO)", y = "C6H6(GT)", \
                 a = -25, b = -15, n = 1, \
                 c = -5, d = 5, m = 1):
    """
    This function takes a column from the airdata, y, a lower and upper bound of values to begin and terminate search, and a desired number of steps, n, creates a grid and runs the loss function on each value within the grid to create a list. It then searches the list for the minimum root mean square error and the position within the grid where it occurs, returning both.
    """

    grid2d = makegrid2d(a = a, b = b, n = n, c = c, d = d, m = m) # create a 2dgrid with given parameters

    rmse2dgrid = [[0, 0, 0] for _ in range(n)] # initialize empty grid for RMSE

    minrmse = loss2d(x = x, y = y, \
                     beta0 = grid2d[0][0][0], \
                     beta1 = grid2d[0][0][1]) # initialize minimum RMSE value
    minb0 = grid2d[0][0][0] # initialize minimum RMSE beta-0
    minb1 = grid2d[0][0][1] # initialize minimum RMSE beta-1

    for i in range(n):
        for j in range(m):
            if loss2d(x = x, y = y, \
                      beta0 = grid2d[i][j][0], \
                      beta1 = grid2d[i][j][1]) < minrmse: # if the loss function returns a smaller RMSE, update the current smallest RMSE and the beta parameters for it
                        minrmse = loss2d(x = x, y = y, \
                                         beta0 = grid2d[i][j][0], \
                                         beta1 = grid2d[i][j][1])
                        minb0 = grid2d[i][j][0]
                        minb1 = grid2d[i][j][1]

    return [minb0, minb1, minrmse]

In [31]:
grid2dsearch(n = 10, m = 100) # this code takes approximately 2.5 minutes to run

[-16.0, 0.0, 2572.1150479712214]

In [52]:
grid2d[9][50] # checking the index of the claimed minimized RMSE value

[-16.0, 0.0]

In [20]:
loss2d(beta0 = -16, beta1 = 0)

2572.1150479712214

In [21]:
y = airdata["C6H6(GT)"]
x = airdata["PT08.S1(CO)"]

b1hat = sum((x - x.mean())*(y - y.mean()))/sum((x - x.mean())**2)
b0hat = y.mean() - x.mean()*b1hat
print(b0hat, b1hat)

loss2d(beta0 = b0hat, beta1 = b1hat)

-23.275221899724677 0.03033035213280187


330.48724365992194

# Gradient Descent

## Just `y`

We may use the same root mean square error function from the grid search section. However, we will need to craft a new function that finds the difference quotient approximation to the derivative.

In [22]:
def diffquo(y = "C6H6(GT)", c = 0, delta = 1):

    """
    Calculates the difference quotient for the root mean square (loss) function given an initial guess for c and a delta value.
    """

    dq = (loss(y = y, c = c + delta) - loss(y = y, c = c)) / delta

    return dq

In [23]:
y.mean()

np.float64(10.083105327549772)

In [24]:
loss(c = -93.629)

9859.408939679419

In [25]:
c = y.mean()
print(c)

10.083105327549772


In [64]:
# this code was run repeatedly until the values converged to the following (incorrect) result

nextc = c - diffquo(c = c, delta = 0.001) * 0.01
print(nextc)

c = diffquo(c = nextc)
print(c)

print(abs(nextc - c))

-93.6290464572703
-94.57486236291516
0.9458159056448494


## Using `y` and another numeric variable `x`