# Project 1: Air Quality with Sequential CV
Task #3
Jacob A. Fericy

This project notebook aims to predict **benzene concentration** `C6H6(GT)` from the **UCI Air Quality Dataset** (UCI ML Repo id=360) and compares two models as we develop a cross-validation algorithm over both models. More specifically we compare these models:

- **_Time-only_** linear model: Day
- **_Full_** linear model: Day + CO(GT) + T + RH + AH

We evaluate with a type of walk‑forward (one‑step‑ahead) validation. We fit on days 1,2,..,d, predict day d + 1, compute MSE, and sum the MSE over the evaluation window

## 1) Imports & Functions

**oneStepMSE**: trains on all rows up to a given day and predicts the following day.  
**sequentialCVMSE** loops forward and sums the one‑step MSE values.


In [15]:
import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def oneStepMSE(X, y, day, day_col="Day"):
    
    #seperate training versus test for the sake of the model
    train_mask = X[day_col] <= day
    test_mask = X[day_col] == (day + 1)
    X_train, y_train = X.loc[train_mask], y.loc[train_mask]
    X_test, y_test = X.loc[test_mask], y.loc[test_mask]

    #ensure next day obs is there
    if len(X_test) != 1:
        print("Exception: Next-Day observation not present!")
        return None

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    return mean_squared_error(y_test, y_pred)

def sequentialCVMSE(X, y, start_day=250, day_col="Day"):

    max_day = int(X[day_col].max())

    mse_sum = 0.0
    
    #iterate through days, one -step ahead MSE until evaluation fails
    for d in range(start_day, max_day):
        mse = oneStepMSE(X, y, day=d, day_col=day_col)
        if mse is None:
            #condition kick-out if there is nothing to add
            break
        mse_sum += mse 
    return mse_sum

def fitEvaluateModel(X_train, X_test, y_train, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    
    return model, mse

## 2) Load and Clean Data

We fetch the dataset and keep only the columns we need for this review. 

In this dataset, the value **-200** is a missing-value sentinel that needs to be filtered out.  
Here we keep your original filter logic: filter each variable sequentially to ensure we have the proper dataset to test.

Then we parse Date on the filtered frame and drop any rows where the date still can’t be parsed.



In [16]:
air_quality = fetch_ucirepo(id=360)
df = air_quality.data.features.copy()

df_clean = df.copy()
vars_to_check = ["C6H6(GT)", "CO(GT)", "T", "RH", "AH"]

#cleans dataset
for v in vars_to_check:
    df_clean = df_clean[df_clean[v] != -200]
    
df["Date"] = df["Date"].astype(str).str.strip()
dt1 = pd.to_datetime(df["Date"], format="%d/%m/%Y", errors="coerce")
dt2 = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df["Date"] = dt1.fillna(dt2)

df_clean['Date'] = pd.to_datetime(df_clean['Date'], errors='coerce')
df_clean = df_clean.sort_values('Date', ascending=True).reset_index(drop=True)




## 3) Aggregate Day Data

The raw data is higher-frequency (hourly).  
We aggregate to **daily means** and create a sequential day index used for time ordering.


In [20]:
#aggregates variables to daily means
df_day = (
    df_clean
    .groupby("Date", as_index=False)[vars_to_check]
    .mean()
    .sort_values("Date")
    .reset_index(drop=True)
)

df_day["Day"] = np.arange(1, len(df_day) + 1)

df = df_day.copy()

## 4) Build features and run walk-forward validation

We compare:
- X_time = [Day]
- X_full = [Day, CO(GT), T, RH, AH]

Then compute summed one-step-ahead MSE starting at **start_day = 250**.

Condition: if there are fewer than 252 daily rows after cleaning, we automatically shift start_day to half the series length so the code still runs.


In [18]:
X_time = df_day[["Day"]]
X_full = df_day[["Day", "CO(GT)", "T", "RH", "AH"]]
y = df_day["C6H6(GT)"]

#ensures starting day allows one split given the data we are pulling from upstream
start_day = 250
if len(df_day) <= start_day + 1:
    start_day = max(1, len(df_day) // 2)

mse_sum_time = sequentialCVMSE(X_time, y, start_day = start_day, day_col = "Day")
mse_sum_full = sequentialCVMSE(X_full, y, start_day = start_day, day_col = "Day")

print("Daily rows:", len(df_day), "| start_day used:", start_day)
print("Overall SUM MSE (time only):", mse_sum_time)
print("Overall SUM MSE (full predictors):", mse_sum_full)


Daily rows: 347 | start_day used: 250
Overall SUM MSE (time only): 1820.9283445426604
Overall SUM MSE (full predictors): 496.46303567275567


## 5) Model Fit

We do the following:
- Run MLR
- Run SLR
- Output and run final completed model

In [19]:
target_col = "C6H6(GT)"
predictor_cols = ["CO(GT)", "T", "RH", "AH"]
needed_cols = ["Date", target_col] + predictor_cols

X = df[predictor_cols]
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 451)

X_train_slr = X_train[["CO(GT)"]]
X_test_slr = X_test[["CO(GT)"]]

slr_model, slr_mse = fitEvaluateModel(
    X_train_slr, X_test_slr, y_train, y_test
)

print(f"SLR Test MSE: {slr_mse:.3f}")

mlr_model, mlr_mse = fitEvaluateModel(
    X_train, X_test, y_train, y_test
)

print(f"MLR Test MSE: {mlr_mse:.3f}")

if mlr_mse < slr_mse:
    print("MLR performs better based on MSE.")
else:
    print("SLR performs better based on MSE.")

best_model = LinearRegression()
best_model.fit(X, y)

print("\nBest model fitted on full dataset.")
print("Coefficients:")
for col, coef in zip(predictor_cols, best_model.coef_):
    print(f"  {col}: {coef:.3f}")

print(f"Intercept: {best_model.intercept_:.3f}")


SLR Test MSE: 4.382
MLR Test MSE: 3.226
MLR performs better based on MSE.

Best model fitted on full dataset.
Coefficients:
  CO(GT): 4.771
  T: 0.120
  RH: -0.016
  AH: 0.689
Intercept: -1.838


## 6) Conclusions

Above we extended the model to include multiple predictors CO(GT), temperature (T), relative humidity (RH), and absolute humidity (AH). From a expertise perspective, this makes sense given benzene concentration is affected not only by emissions but also by atmospheric conditions that influence dispersion and chemical behavior. As such, there are many potential causes to consider.

When looking holistically, because MLR uses more relevant information, we would expect the Mean Squared Error (MSE) to be lower than that of SLR. We compare the models using Mean Squared Error (MSE) given MSE penalizes larger prediction errors, making it a good driver for regression where large deviations may be problematic.

As we look at the results, we see indeed that MLR is the optimal model. After identifying MLR as the preferred model, we refit it using the full dataset to obtain the most reliable parameter estimates. This final model would be the best choice for interpretation or future prediction tasks using this dataset.