# Multiple Regression using Weather Data to Predict Air Quality 

In the following report, we analyze the impact for our various weather data predictors (i.e. wind, temperature, pressure and humidity) on the level of pollutants by building and evaluating a multiple regression model. We output statistics related to R-Squared and Adjusted R-Squared values to determine how well the independent variables correlate with the dependent variables, and to help choose a minimal set of weather predictors that will help to prevent our model from overfitting.

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
archive_path = '/home/UNT/tjt0147/csce5310/'
file_path = os.path.join(archive_path, 'houston-aqi-2010-2021.csv')
df = pd.read_csv(file_path)
print(df.head())

   Unnamed: 0  day_of_year  year   latitude  longitude  avg_pm10  aqi_pm10  \
0           0            2  2010  29.733726 -95.257593      13.0        12   
1           1            2  2010  29.733726 -95.257593      13.0        12   
2           2            2  2010  29.733726 -95.257593      13.0        12   
3           3            2  2010  29.733726 -95.257593      13.0        12   
4           4            2  2010  29.733726 -95.257593      13.0        12   

     avg_co  aqi_co    avg_no2  ...    avg_o3  aqi_o3  avg_pm25  aqi_pm25  \
0  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
1  0.297667     NaN  17.258333  ...  0.027294      32      11.6      48.0   
2  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
3  0.297667     NaN  17.258333  ...  0.027294      32       9.7      40.0   
4  0.325000     6.0  17.258333  ...  0.027294      32      11.6      48.0   

    avg_so2  aqi_so2  avg_humidity  avg_temperature  avg_wind  avg_p

## Adjusted R-Squared and MSE for three predictors

In the below, we output the results of a multiple regression when only three predictors are used, in order to determine if only three independent variables are sufficient for our regression equation, to help prevent our model from overfitting.

In [3]:
independent_variables = [
    ['avg_pressure', 'avg_wind', 'avg_temperature'],
    ['avg_wind', 'avg_temperature', 'avg_humidity'],
    ['avg_temperature', 'avg_humidity', 'avg_pressure'],
    ['avg_humidity', 'avg_pressure', 'avg_wind']
]

dependent_variables = ['avg_pm10', 'avg_co', 'avg_no2', 'avg_o3', 'avg_pm25', 'avg_so2']

def get_r_squared(X, Y):
    return model.score(X, Y)

def get_adjusted_r_squared(r_squared, n, k):    
    return 1 - ((1 - r_squared) * (n - 1) / (n - k - 1))

for predictors in independent_variables:
    model = LinearRegression()
    X, Y = df[predictors], df[dependent_variables]
    scaler = MinMaxScaler()
    X_normalized = scaler.fit_transform(X)
    X_train, X_test, Y_train, Y_test = train_test_split(X_normalized, Y, test_size=0.2, random_state=42)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    mse = mean_squared_error(Y_test, Y_pred)
    print(f"Predictors: {predictors}")
    print("Coefficients:", model.coef_)
    print("Intercept:", model.intercept_)
    print(f"Mean Squared Error on Test Set:", mse)
    r_squared = get_r_squared(X_test, Y_test)
    print(f"Regular R-squared: {r_squared}")
    print(f"Adjusted R-squared: {get_adjusted_r_squared(r_squared, *X_test.shape)}") 
    print()

Predictors: ['avg_pressure', 'avg_wind', 'avg_temperature']
Coefficients: [[ 7.59704332e+00  2.18856323e+01  3.64478399e+01]
 [-7.76873737e-02 -2.09062065e-01 -1.62389496e-01]
 [-5.38101699e+00 -1.46397911e+01 -9.27348340e+00]
 [ 6.68846136e-04 -8.14343713e-03  1.14641063e-02]
 [-2.41200428e-01 -3.78860778e+00  6.92681105e+00]
 [-1.85964623e-01 -6.70266056e-01 -7.97675897e-01]]
Intercept: [3.54649203e+00 4.35023658e-01 2.63750476e+01 2.03128838e-02
 8.64687254e+00 1.45622799e+00]
Mean Squared Error on Test Set: 48.841781714338055
Regular R-squared: 0.16154761357269626
Adjusted R-squared: 0.15887738304267307

Predictors: ['avg_wind', 'avg_temperature', 'avg_humidity']
Coefficients: [[ 2.45883356e+01  3.58701121e+01 -2.38270382e+01]
 [-1.90532084e-01 -1.20410918e-01 -1.97625911e-02]
 [-1.26197440e+01 -5.79035927e+00 -5.57146087e+00]
 [-5.12413263e-03  1.35863062e-02 -1.79672200e-02]
 [-3.84068930e+00  6.97150446e+00  5.64054186e-01]
 [-4.87049188e-01 -5.88698196e-01 -8.39598218e-01]]
Int

## Using all predictors

We obtain the same statistics of Mean Squared Error and Adjusted R-Squared value using all of the predictors, to determine if all of the factors combined (i.e. wind, temperature, pressure and humidty) are useful for our multiple regression model, and lead to better overall performance for the air quality predictions.

In [4]:
predictors = ['avg_pressure', 'avg_wind', 'avg_temperature', 'avg_humidity']

model = LinearRegression()
X, Y = df[predictors], df[dependent_variables]
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X_normalized, Y, test_size=0.2, random_state=42)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print(f"Predictors: {predictors}")
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print(f"Mean Squared Error on Test Set:", mse)
r_squared = get_r_squared(X_test, Y_test)
print(f"Regular R-squared: {r_squared}")
print(f"Adjusted R-squared: {get_adjusted_r_squared(r_squared, *X_test.shape)}") 

Predictors: ['avg_pressure', 'avg_wind', 'avg_temperature', 'avg_humidity']
Coefficients: [[-5.88236352e-01  2.44883100e+01  3.55837280e+01 -2.39072255e+01]
 [-8.85882341e-02 -2.05595912e-01 -1.63540290e-01 -3.18387812e-02]
 [-7.64538281e+00 -1.39197895e+01 -9.51253027e+00 -6.61366582e+00]
 [-5.75113130e-03 -6.10207350e-03  1.07863556e-02 -1.87512040e-02]
 [-5.04352797e-02 -3.84926548e+00  6.94694994e+00  5.57178938e-01]
 [-4.96601046e-01 -5.71492828e-01 -8.30469479e-01 -9.07293987e-01]]
Intercept: [20.95873347  0.45821265 31.19194895  0.03396986  8.24106492  2.11703332]
Mean Squared Error on Test Set: 45.79309124222485
Regular R-squared: 0.19550379452067856
Adjusted R-squared: 0.1920840444442521


## Summary

The low values for the Adjusted R-Squared for each model suggests a relatively low correlation between our independent variables and dependent variables. Since the Adjusted R-Squared value actually increases when included in a multiple regression model, we can conclude that all of the variables from the weather data are beneficial for predicting the air quality. However, their lack of efficacy in predicting the air quality indicates that the weather information alone is not substantial for obtaining quality predictions for our regression model. Our best model appears to be the one that uses all four weather predictors of wind, temperature, pressure and humidity for making predictions about air quality measurements. Out of the four models using only three predictors, it appears that wind, temperature and humidity gives us our best results, with lowest Mean Squared Error and highest Adjusted R-Squared values. Compared to when all four predictors are used, the Adjusted R-Squared value for only three weather predictors of wind, temperature and humidity is only .001 less, suggesting that perhaps pressure may be less important to consider for a regression model when predicting the air quality.