# Regression Modelling

## Analytical Assumption

We assume that **energy inefficiency is not random** across campus buildings.

### Hypothesis
> Buildings with certain operational patterns (hour of day, utility type, building size, and readings frequency) can predict `energy_per_sqm`.

If this assumption is correct, a regression model should explain a significant portion of energy variation.

---

## Why Regression?

Traditional analysis:
- Shows historical inefficiency trends.

Regression modelling:
- Explains *drivers* of inefficiency.
- Predicts future performance.
- Supports proactive energy optimization.

### Goal
Predict **`energy_per_sqm`** and compare model performance against a simple baseline assumption.

---

## Inputs (from Notebook 03)

- `workspace.default.daily_efficiency`
- `workspace.default.hourly_efficiency`
- `workspace.default.priority_buildings`

---

## Outputs (from this notebook)

- `workspace.default.daily_efficiency_predictions`  
  *(Delta table used for dashboarding and advanced analysis)*

### Model Evaluation Metrics
- RMSE (Root Mean Squared Error)
- R² (Coefficient of Determination)

---

## Overview

This notebook performs:

- Feature engineering from efficiency tables
- Regression model training
- Model evaluation
- Saving predictions for visualization and dashboards


### **Load Tables**

In [0]:
from pyspark.sql import functions as F

daily_eff = spark.table("workspace.default.daily_efficiency")
hourly_eff = spark.table("workspace.default.hourly_efficiency")
priority_buildings = spark.table("workspace.default.priority_buildings")

display(daily_eff.limit(5))


## # **Assumption Baseline**

In [0]:
baseline = (
    daily_eff
    .select(F.avg("energy_per_sqm").alias("baseline"))
    .collect()[0]["baseline"]
)

daily_eff = daily_eff.withColumn(
    "baseline_prediction",
    F.lit(baseline)
)


##  Baseline Model
Before regression, we assume energy use is constant across campus.

This naive assumption becomes our comparison benchmark.


## FEATURE ENGINEERING (FOR REGRESSION)

In [0]:
from pyspark.sql.functions import hour, dayofweek

df = (
    daily_eff
    .withColumn("day_of_week", dayofweek("day"))
    .withColumn("is_priority",
                F.when(F.col("sitename").isin(
                    [r.sitename for r in priority_buildings.select("sitename").collect()]
                ), 1).otherwise(0)
    )
)


## **BUILD REGRESSION MODEL**

In [0]:
# ============================================================
# BUILD REGRESSION MODEL (KEEP METADATA FOR DASHBOARD)
# Goal:
# Predict energy_per_sqm using engineered efficiency features
# AND retain sitename/day/utility for dashboarding.
# ============================================================

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql import functions as F

# =========================================
# 1) LOAD INPUT TABLES
# =========================================
df = spark.table("workspace.default.daily_efficiency")

priority = (
    spark.table("workspace.default.priority_buildings")
         .select("sitename")
         .distinct()
         .withColumn("is_priority", F.lit(1))
)

# =========================================
# 2) FEATURE ENGINEERING (CREATE COLUMNS FIRST)
# =========================================

# 2a) day_of_week from day (0=Mon ... 6=Sun)
df = df.withColumn("day_of_week", F.dayofweek("day"))          # 1=Sun ... 7=Sat
df = df.withColumn("day_of_week", (F.col("day_of_week") + 5) % 7)  # 0=Mon ... 6=Sun

# 2b) is_priority from priority table (left join)
df = (
    df.join(priority, on="sitename", how="left")
      .fillna({"is_priority": 0})
)

# =========================================
# 3) CAST FEATURES/LABEL TO NUMERIC
# =========================================
df = (
    df.withColumn("square_meters", F.col("square_meters").cast("double"))
      .withColumn("n_readings", F.col("n_readings").cast("double"))
      .withColumn("day_of_week", F.col("day_of_week").cast("double"))
      .withColumn("is_priority", F.col("is_priority").cast("double"))
      .withColumn("energy_per_sqm", F.col("energy_per_sqm").cast("double"))
)

# =========================================
# 4) CLEAN NULLS
# =========================================
feature_cols = ["square_meters", "n_readings", "day_of_week", "is_priority"]
df = df.dropna(subset=feature_cols + ["energy_per_sqm"])

# =========================================
# 5) ASSEMBLE FEATURES (KEEP METADATA)
# =========================================
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_feat = assembler.transform(df)

meta_cols = ["sitename", "day", "utility"]

model_df = df_feat.select(
    *meta_cols,
    *feature_cols,
    "features",
    "energy_per_sqm"
)

# =========================================
# 6) TRAIN / TEST SPLIT
# =========================================
train, test = model_df.randomSplit([0.8, 0.2], seed=42)

# =========================================
# 7) TRAIN MODEL
# =========================================
lr = LinearRegression(featuresCol="features", labelCol="energy_per_sqm")
model = lr.fit(train)

# =========================================
# 8) PREDICT (METADATA INCLUDED)
# =========================================
predictions = model.transform(test)

display(
    predictions.select(
        "sitename", "day", "utility",
        "energy_per_sqm", "prediction",
        "square_meters", "n_readings", "day_of_week", "is_priority"
    ).limit(10)
)


## **EVALUATE MODEL (PROVE ASSUMPTION)**

In [0]:

# EVALUATE REGRESSION MODEL (PROVE ASSUMPTION)
# Goal:
# Measure how well the regression model predicts energy_per_sqm
# using standard regression metrics:
#   - RMSE (Root Mean Squared Error)
#   - R² (Coefficient of Determination)


# Import evaluator from Spark ML
from pyspark.ml.evaluation import RegressionEvaluator


# STEP 1 — Create RMSE evaluator
# RMSE tells us average prediction error magnitude
# Lower RMSE = better model

rmse_eval = RegressionEvaluator(
    labelCol="energy_per_sqm",      # actual values
    predictionCol="prediction",     # model output
    metricName="rmse"               # evaluation metric
)


# STEP 2 — Create R² evaluator
# R² explains how much variance the model captures
# Higher R² = better model

r2_eval = RegressionEvaluator(
    labelCol="energy_per_sqm",
    predictionCol="prediction",
    metricName="r2"
)


# STEP 3 — Calculate evaluation metrics

rmse = rmse_eval.evaluate(predictions)
r2 = r2_eval.evaluate(predictions)


# STEP 4 — Print model performance results

print("RMSE:", rmse)
print("R2:", r2)


%md
## **Interpretation**

If R² is high:
✔ Our assumption is correct — inefficiency follows predictable patterns.

If RMSE is lower than baseline:
✔ Regression improves prediction beyond simple averages.


## **Write predictions WITH metadata**

In [0]:
from pyspark.sql import functions as F

pred_out = (
    predictions
      .select(
          "sitename", "day", "utility",
          "energy_per_sqm", "prediction",
          "square_meters", "n_readings", "day_of_week", "is_priority"
      )
      .withColumn("error", F.col("energy_per_sqm") - F.col("prediction"))
      .withColumn("abs_error", F.abs(F.col("error")))
      .withColumn("squared_error", F.col("error") * F.col("error"))
)

(
  pred_out.write
    .mode("overwrite")
    .option("overwriteSchema", "true")   # key line
    .format("delta")
    .saveAsTable("workspace.default.daily_efficiency_predictions")
)

display(pred_out.limit(10))


## **SAVE OUTPUT FOR DASHBOARD**

In [0]:
(predictions.write
 .mode("overwrite")
 .format("delta")
 .saveAsTable("workspace.default.daily_efficiency_predictions"))
