# Postdam PM2.5 Support Vector Regression (SVR) Forcasting 

* Between 2013 and 2023, data collected by DEBB021 was used.
* To increase the accuracy of PM2.5 data estimation, NO2, O3, SO2, PM10 pollutant gas data accepted by the EEA was added.


In [1]:
# imports
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np, pandas as pd

In [2]:
# import src
import model_base as mb
import svr as svr

## Data Exploration

* Load Data


In [3]:
df= mb.get_cleaned_datetime_df()

In [4]:
mb.set_start_date_time_index(df)
df.head()

Unnamed: 0_level_0,Start_Timestamp,End_Timestamp,End,PM2.5-Pollutant,PM2.5-Value,PM2.5-Unit,PM2.5-Validity,PM2.5-Verification,PM10-Pollutant,PM10-Value,...,O3-Pollutant,O3-Value,O3-Unit,O3-Validity,O3-Verification,SO2-Pollutant,SO2-Value,SO2-Unit,SO2-Validity,SO2-Verification
Start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01 00:00:00,1356998400,1357002000,2013-01-01 01:00:00,6001,71.04,ug.m-3,1,1,5,88.96,...,7,43.17,ug.m-3,1,1,1,12.18,ug.m-3,1,1
2013-01-01 01:00:00,1357002000,1357005600,2013-01-01 02:00:00,6001,20.52,ug.m-3,1,1,5,25.17,...,7,57.15,ug.m-3,1,1,1,4.65,ug.m-3,1,1
2013-01-01 02:00:00,1357005600,1357009200,2013-01-01 03:00:00,6001,9.56,ug.m-3,1,1,5,11.97,...,7,63.31,ug.m-3,1,1,1,1.33,ug.m-3,1,1
2013-01-01 03:00:00,1357009200,1357012800,2013-01-01 04:00:00,6001,9.45,ug.m-3,1,1,5,11.73,...,7,63.18,ug.m-3,1,1,1,1.33,ug.m-3,1,1
2013-01-01 04:00:00,1357012800,1357016400,2013-01-01 05:00:00,6001,13.02,ug.m-3,1,1,5,15.88,...,7,61.7,ug.m-3,1,1,1,1.33,ug.m-3,1,1


# Support Vector Regression (SVR) 



Support Vector Regression (SVR) is an application of the Support Vector Machine (SVM) algorithm for regression problems. SVM is a supervised learning algorithm commonly used for classification tasks, but it can also be adapted for regression, resulting in the SVR model. SVR attempts to find the best fit line (in a higher-dimensional space) that has the maximum number of points within a certain threshold distance from the line. The main idea is to minimize error, individualizing the hyperplane that maximizes the margin.

In SVR:

* The goal is to find a function that has at most an epsilon deviation from the actually obtained targets y for all the training data, and at the same time is as flat as possible.
* SVR uses the same principles as SVM for classification, with only a few minor differences. First, because output is a real number it becomes very hard to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM.

## Splitting Data 

Train, Validation and Test data

In [5]:
train_data, validation_data, test_data = mb.split_data(df)

Training set size: 52588
Validation set size: 17529
Test set size: 17531


In [6]:
# Extract the features
X_train, X_val, X_test = mb.extract_features(train_data, validation_data, test_data)
# Extract the target variable
y_train, y_val, y_test = mb.extract_target(train_data, validation_data, test_data)

## Principle Component Analysis (PCA)
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

The scaling step is important because PCA is sensitive to the variances of the initial variables. Therefore, scaling features before applying PCA ensures that each feature contributes equally to the analysis.

## Model Creation
* Initialize Linear Regression Model
* Train model

In [7]:
# Initialize and train the linear regression model
# pipeline = svr.init_svr_pipeline()
# pipeline.fit(X_train, y_train)

# Evaluation 

## With Validation Data

Error metrics MAE, MSE, RMSE, MASE, MAPE

* Regarding the MASE metric, calculating it requires a baseline prediction model for the time series, which is typically done by using the last observed value to predict the next (in the simplest case) or using more complex methods like ARIMA for one-step ahead forecasting. This is not included in the above script as it would require additional steps to implement the naive forecasting method for a time series.

### Predict Validation


In [8]:
# # Make predictions on the validation set
# y_val_pred = pipeline.predict(X_val)

# print(y_val_pred)

# # Error Metric
# mb.evolve_error_metrics(y_val,y_val_pred)
# mb.naive_mean_absolute_scaled_error(y_val,y_val_pred)

## With Test Data


In [9]:
# # Predict on the test set
# y_test_pred = pipeline.predict(X_test)

# # Error Metric
# mb.evolve_error_metrics(y_test,y_test_pred)
# mb.naive_mean_absolute_scaled_error(y_test,y_test_pred)

## Plot Table 


In [10]:
# mb.plot_pm_true_predict(validation_data, y_val_pred, 'Validation')
# mb.plot_pm_true_predict(test_data, y_test_pred, 'Test')

# HyperPramater Tuning

Linear Regression typically has fewer hyperparameters than other models like neural networks or ensemble models. However, there are still some aspects of the model that you can adjust. For instance, you can apply regularization, which can be considered a form of hyperparameter tuning. The most common types of regularized linear regression are Ridge Regression (L2 regularization) and Lasso Regression (L1 regularization).

In [None]:
best_svr_estimater_model = svr.tune_and_evaluate_svr(df)

Training set size: 52588
Validation set size: 17529
Test set size: 17531
Started 2023-11-07 23:48:41
Fitted 2023-11-07 23:48:41
