## Linear Regression and SVM Modeling for Interval Target (SAS Viya)

**EXAMPLE:** Linear and Support Vector Based Modeling for Interval Target using Python & SAS Viya  
**DATA SOURCE:**  
Data: bike_sharing_demand.csv   
Fanaee-T, H. (2013). Bike Sharing Dataset. UCI Machine Learning Repository. [Link](https://doi.org/10.24432/C5W894) 

**DESCRIPTION:** This template demonstrates a workflow for building predictive models in Python using non-tree-based modeling techniques such as Linear Regression and Support Vector Machines (SVM).  
**PURPOSE:** The goal is to predict the count of bikes rented per hour using various predictor variables, such as weather, season, temperature, hour, month, and weekday.  
**DETAILS:**  
- Models built include: Linear Regression, Support Vector Machines (SVM) & Ensemble
- Preprocessing and Scoring the validation and test data
- Model Assessment & Model Comparison: Mean Square Error


In [None]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import train_test_split
from sasviya.ml.linear_model import LinearRegression
from sasviya.ml.svm import SVR
import matplotlib.pyplot as plt

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

### Data Loading and Preprocessing
**Importing Data and Defining Variables**
- Load the dataset for both training and testing partitions.
- Define variables necessary for further analysis
- Outlier Treatment
- Feature Selection to identify the most relevant features for prediction.


In [None]:
# Construct the workspace path
workspace = f"{os.path.abspath('')}/../../data/"

# Import Data and Define Variables
data = pd.read_csv(workspace + "bike_sharing_demand.csv")

# Split the data into Train, Validation, and Test sets (40% Train, 30% Validation, 30% Test)
train_data, temp_test_data = train_test_split(data, test_size=0.6, random_state=42)
val_data, test_data = train_test_split(temp_test_data, test_size=0.5, random_state=42)

# Create X and y variables for modeling
X_train, y_train = train_data.drop(columns=['count']), train_data['count']
X_val, y_val = val_data.drop(columns=['count']), val_data['count']
X_test, y_test = test_data.drop(columns=['count']), test_data['count']

# Print first 5 rows of train dataset
print("Top 5 rows of bikesharing train dataset:")
print(train_data.head(5))

**Treat Outliers**  
The target variable "count" is highly skewed, and in order to address this, a logarithmic transformation is applied

In [None]:
# Perform log transformation on 'count' variable
y_train = np.log1p(train_data['count'])
y_val = np.log1p(val_data['count'])
y_test = np.log1p(test_data['count'])

fig, axes = plt.subplots(1, 2, figsize=(18, 6))
# Plot histogram of original 'count' variable
axes[0].hist(train_data['count'], bins=30, alpha=0.5, color='blue', label='Original')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Original Target "count" variable (Train)')
axes[0].legend()

# Plot histogram of log-transformed 'count' variable
axes[1].hist(y_train, bins=30, alpha=0.5, color='green', label='Log Transformed')
axes[1].set_xlabel('Log(Count + 1)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Log Transformed Target "count" variable (Train)')
axes[1].legend()
plt.tight_layout()
plt.show()

**Feature Selection**  
&emsp; Identify the most relevant features for prediction.  
&emsp; Feature selection is performed solely on the training set, and the same selected features are used across other partitions to prevent data leakage.   
&emsp; SelectKBest technique is used to select the top k features based on univariate statistical tests

In [None]:
# Exclude 'date' from X_train
X_train_subset = X_train.drop(columns=['date'])

# Perform feature selection using SelectKBest technique that selects the top k features based on univariate statistical tests
selector = SelectKBest(score_func=f_regression, k=5)  # Select top 5 features
X_train_selected = selector.fit_transform(X_train_subset, y_train)

# Get selected feature names
selected_features = X_train_subset.columns[selector.get_support()]

# Print selected feature names
print("Selected Features:", selected_features)

# Subset all partitions using selected features
X_train_selected = pd.DataFrame(X_train_selected, columns=selected_features)
X_val_selected = X_val[selected_features]
X_test_selected = X_test[selected_features]


### Linear Regression Model Training, Scoring and Evaluation
For more information regarding SAS Viya Linear Regression, refer to [this link](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p0kx8n36nycmj0n1h1o8d3tqfxc3.htm).



In [None]:
# Initialize Linear Regression model
sas_lr = LinearRegression()

# Fit the model
sas_lr.fit(X_train_selected, y_train)

# Make predictions on training data
y_train_pred = sas_lr.predict(X_train_selected)

# Calculate MSE on training partition
mse_train = mean_squared_error(y_train, y_train_pred)
print(f"Training Mean Squared Error (Linear Regression): {mse_train:.3f}")

# Make predictions on validation data
y_val_pred = sas_lr.predict(X_val_selected)

# Calculate MSE on validation partition
mse_val = mean_squared_error(y_val, y_val_pred)
print(f"Validation Mean Squared Error (Linear Regression): {mse_val:.3f}")

# Make predictions on test data
y_test_pred = sas_lr.predict(X_test_selected)

# Calculate MSE on test partition
mse_test = mean_squared_error(y_test, y_test_pred)
print(f"Test Mean Squared Error (Linear Regression): {mse_test:.3f}")

### SVM Model Training, Scoring and Evaluation
NOTE: An SVM with an interval target is often referred to as Support Vector Regression (SVR)  
For more information regarding SAS Viya SVR, refer to [this link](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p14qlscxhb7i70n196xmpynf7lay.htm).

In [None]:
# Initialize SVM model
sas_svm_model = SVR()

# Fit the SVM model
sas_svm_model.fit(X_train_selected, y_train)

# Make predictions on training data
y_train_pred_svm = sas_svm_model.predict(X_train_selected)

# Calculate MSE on training partition
mse_train_svm = mean_squared_error(y_train, y_train_pred_svm)
print(f"Training Mean Squared Error (SVM): {mse_train_svm:.3f}")

# Make predictions on validation data
y_val_pred_svm = sas_svm_model.predict(X_val_selected)

# Calculate MSE on validation partition
mse_val_svm = mean_squared_error(y_val, y_val_pred_svm)
print(f"Validation Mean Squared Error (SVM): {mse_val_svm:.3f}")

# Make predictions on test data
y_test_pred_svm = sas_svm_model.predict(X_test_selected)

# Calculate MSE on test partition
mse_test_svm = mean_squared_error(y_test, y_test_pred_svm)
print(f"Test Mean Squared Error (SVM): {mse_test_svm:.3f}")

### Ensemble Model Training, Scoring, and Evaluation
Combine predictions from Linear Regression and SVM using simple averaging

In [None]:
# Ensemble model combining Linear Regression and SVM predictions
ensemble_model = VotingRegressor([('lr', sas_lr), ('svm', sas_svm_model)])
ensemble_model.fit(X_train_selected, y_train)

# Make predictions on validation and test data for the ensemble model
y_train_pred_ensemble = ensemble_model.predict(X_train_selected)
y_val_pred_ensemble = ensemble_model.predict(X_val_selected)
y_test_pred_ensemble = ensemble_model.predict(X_test_selected)

# Calculate MSE on training partition for the ensemble model
mse_train_ensemble = mean_squared_error(y_train, y_train_pred_ensemble)
print(f"Training Mean Squared Error (Ensemble): {mse_train_ensemble:.3f}")

# Calculate MSE on validation partition for the ensemble model
mse_val_ensemble = mean_squared_error(y_val, y_val_pred_ensemble)
print(f"Validation Mean Squared Error (Ensemble): {mse_val_ensemble:.3f}")

# Calculate MSE on test partition for the ensemble model
mse_test_ensemble = mean_squared_error(y_test, y_test_pred_ensemble)
print(f"Test Mean Squared Error (Ensemble): {mse_test_ensemble:.3f}")

### Overall Model Assessment
Examine the distribution of residuals to assess model assumptions and identify any bias or variance issues

In [None]:
# Calculate residuals for Linear Regression, SVM, and Ensemble models on the validation data
residuals_lr = y_val - y_val_pred
residuals_svm = y_val - y_val_pred_svm
residuals_ensemble = y_val - y_val_pred_ensemble

# Plot the distribution of residuals for all three models
plt.figure(figsize=(12, 8))

# Plot residuals for Linear Regression model
sns.histplot(residuals_lr, kde=True, color='red', label='Linear Regression Residuals')

# Plot residuals for SVM model
sns.histplot(residuals_svm, kde=True, color='green', label='SVM Residuals')

# Plot residuals for Ensemble model
sns.histplot(residuals_ensemble, kde=True, color='blue', label='Ensemble Residuals')

# Set labels and title
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals: Linear Regression vs. SVM vs. Ensemble')
plt.legend()
plt.grid(True)
plt.show()

### Overall Model Comparison
Compare Mean Squared Error (MSE) across the models

In [None]:
# Define models and corresponding MSE values
models = ['Linear Regression', 'SVM', 'Ensemble']
mse_train_values = [mse_train, mse_train_svm, mse_train_ensemble]
mse_val_values = [mse_val, mse_val_svm, mse_val_ensemble]
mse_test_values = [mse_test, mse_test_svm, mse_test_ensemble]

# Plot MSE for all models across validation and test partitions
plt.figure(figsize=(10, 6))
plt.plot(models, mse_train_values, marker='o', label='Training MSE', color='blue')
plt.plot(models, mse_val_values, marker='s', label='Validation MSE', color='green')
plt.plot(models, mse_test_values, marker='x', label='Test MSE', color='red')

plt.xlabel('Model')
plt.ylabel('Mean Squared Error')
plt.title('Comparison of Mean Squared Error (MSE) for Linear Regression, SVM, and Ensemble')
plt.legend()
plt.grid(True)
plt.show()


Linear Regression is the champion model because it has the lowest MSE.