# 06 — Dashboard (Regression Only)

**Goal:** Create dashboard-ready outputs for regression predictions.

**Inputs:**
- `workspace.default.daily_efficiency_predictions` (from Notebook 04)

**Outputs (this notebook):**
- `workspace.default.regression_dashboard_building_summary`
- `workspace.default.regression_dashboard_timeseries`
- `workspace.default.regression_dashboard_error_distribution`

**Dashboard tiles:**
- Actual vs Predicted trend (time)
- Worst buildings by error
- Error distribution
- Priority vs Non-priority comparison


## **Load predictions + sanity check**

In [0]:
from pyspark.sql import functions as F

pred = spark.table("workspace.default.daily_efficiency_predictions")
display(pred.limit(10))

pred.printSchema()


## **Create clean “error” columns**

In [0]:
# ============================================================
# CREATE MODEL ERROR COLUMNS
# ============================================================
# Goal:
# Compare actual vs predicted energy_per_sqm values.
# These columns help evaluate regression performance and
# will later be used inside dashboard metrics.
#
# Columns created:
# - error         = actual - prediction
# - abs_error     = absolute error (used for MAE)
# - squared_error = squared error (used for RMSE)
# ============================================================

from pyspark.sql import functions as F

pred2 = (
    pred
    # Difference between actual and predicted values
    .withColumn("error", F.col("energy_per_sqm") - F.col("prediction"))

    # Absolute error (how far prediction is regardless of sign)
    .withColumn("abs_error", F.abs(F.col("error")))

    # Squared error (penalizes large mistakes more strongly)
    .withColumn("squared_error", F.col("error") * F.col("error"))
)

# Preview prediction results with error metrics
display(
    pred2.select(
        "sitename",
        "day",
        "utility",
        "energy_per_sqm",
        "prediction",
        "error",
        "abs_error"
    ).limit(20)
)


## **Building-level summary table**

In [0]:
# ============================================================
# BUILDING-LEVEL REGRESSION SUMMARY TABLE
# ============================================================
# Goal:
# Aggregate prediction performance at building level.
#
# Why this is important:
# Dashboards usually show building performance, not raw rows.
# This table gives quick KPIs:
#
# - n_rows        : number of prediction samples
# - avg_actual    : average true energy usage
# - avg_predicted : average predicted usage
# - mae           : mean absolute error
# - rmse          : root mean squared error
# ============================================================

from pyspark.sql import functions as F

building_summary = (
    pred2.groupBy("sitename")
    .agg(
        # Number of records per building
        F.count("*").alias("n_rows"),

        # Average actual energy consumption
        F.avg("energy_per_sqm").alias("avg_actual"),

        # Average predicted energy consumption
        F.avg("prediction").alias("avg_predicted"),

        # Mean Absolute Error (easy-to-understand accuracy metric)
        F.avg("abs_error").alias("mae"),

        # Root Mean Squared Error (penalizes large errors)
        F.sqrt(F.avg("squared_error")).alias("rmse")
    )
    # Sort buildings by worst prediction performance
    .orderBy(F.col("rmse").desc())
)

# ------------------------------------------------------------
# SAVE AS DELTA TABLE FOR DASHBOARD NOTEBOOKS
# ------------------------------------------------------------
# This becomes your final dashboard dataset.
# Notebook 06 will visualize this table.
# ------------------------------------------------------------
building_summary.write.mode("overwrite") \
    .format("delta") \
    .saveAsTable("workspace.default.regression_dashboard_building_summary")

# Preview output
display(building_summary.limit(30))


## **Time series table (dashboard tile: “Actual vs Predicted over time”)**

In [0]:
timeseries = (
    pred2.groupBy("day")
    .agg(
        F.avg("energy_per_sqm").alias("avg_actual"),
        F.avg("prediction").alias("avg_predicted"),
        F.avg("abs_error").alias("avg_abs_error")
    )
    .orderBy("day")
)

timeseries.write.mode("overwrite").format("delta").saveAsTable(
    "workspace.default.regression_dashboard_timeseries"
)

display(timeseries)


## **Error distribution table**

In [0]:
# bucket errors to visualize distribution easily
error_dist = (
    pred2.withColumn("abs_error_bucket", (F.col("abs_error")/10).cast("int")*10)  # 0-10, 10-20, etc.
    .groupBy("abs_error_bucket")
    .agg(F.count("*").alias("count"))
    .orderBy("abs_error_bucket")
)

error_dist.write.mode("overwrite").format("delta").saveAsTable(
    "workspace.default.regression_dashboard_error_distribution"
)

display(error_dist)
