# Fabric Capacity Metrics Long Term Retention

## Background and Purpose
The Fabric Capacity Metrics App enables Microsoft Fabric Capacity (and Premium Capacity) administrators to track utilization of Fabric Capacities over the most recent 14-day period.

There may be scenarios where longer-term history of capacity utilization is needed to understand long term trends, identify seasonal workload patterns, and facilitate future capacity planning. 

This notebook can be used as a solution accelerator for importing summaries of Fabric Capacity metrics from the Fabric Capacity Metrics App for long term storage and analysis. The data is imported at the granularity of one record per item per operation type per hour. The data is stored in a Fabric Lakehouse for long-term storage and analysis.

## Warning!
<mark>This solution is not approved and not supported by Microsoft. The structure and content of the Fabric Capacity Metrics App that serves as the data source for this solution may change without notice, rendering this solution inoperable. Use at your own risk.</mark>

## Installation and Usage Instructions
1. You must be an administrator of one or more Fabric and/or Premium Capacities.
1. Install the [Fabric Capacity Metrics App](https://learn.microsoft.com/en-us/fabric/enterprise/metrics-app-install?tabs=1st)
1. Download and [import this Spark Notebook into a Fabric Workspace](https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook#import-existing-notebooks). The workspace must be hosted on a Fabric Capacity.
1. [Connect a new or existing Fabric Lakehouse to your notebook](https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook#connect-lakehouses-and-notebooks).
1. If necessary, modify the parameter values in this notebook, such as the name of the Workspace and Semantic Model associated with the Fabric Capacity Metrics App that will serve as the source of your data.
1. Execute the notebook for the first time, which will create the following tables in your Lakehouse: 
    - **hourly_capacity_metrics** stores summaries of capacity operations at the granularity of one record per hour per item
    - **items** stores a catalog of items that can provide context for metrics stored in the hourly_capacity_metrics table
    - **processing_log** a helper table that will store one record for each successfully completed data load operation. This table facilitiates efficient incremental retrieval of historical metrics.
1. Finally, schedule the Notebook to run on a recurring schedule (such as daily). Note: to avoid gaps in historical data, the notebook must be executed at least once per 14-day period (preferably at least weekly).

## Usage Notes
- This solution imports summaries of data (rather than a detailed log of individual operations). The granularity of this data may or may not be sufficient for your needs.
- This solution uses the identity context of the person who is running this notebook (or who owns the notebook in a scheduled/unattended execution scenario)
- This solution will retrieve data only from capacities in which the user is an administrator
- By default, this solution will merge all data into delta tables in the connected Lakehouse and will also store raw data files in parquet format in the Files area of the Lakehouse. If desired, you may disable the storage of raw parquet files by setting the *store_raw_data* parameter to *False*.
- To improve efficiency, this solution filters out summary records where CU consumption, total duration and total operation counts are all 0.
- To analyze the data, you may use Spark-based notebooks, query the data using the built-in SQL Analytics Endpoints, create a semantic model and corresponding reports, or use any other suitable analytical tools.

In [1]:
#Import required libraries
import pandas as pd
import sempy.fabric as fabric
from datetime import date, datetime, timedelta
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, TimestampNTZType
from pyspark.sql.functions import current_timestamp, input_file_name, date_format
from pyspark.sql import functions as F

spark.conf.set("sprk.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.binSize", "1073741824")
spark.conf.set("spark.sql.catalog.pbi", "com.microsoft.azure.synapse.ml.powerbi.PowerBICatalog")

StatementMeta(, 13d3a19c-ed66-4053-a1b7-167258d4a7eb, 3, Finished, Available, Finished)

In [2]:
#Define parameters
source_workspace = 'Microsoft Fabric Capacity Metrics'
source_dataset = 'Fabric Capacity Metrics'

#Specify whether to store raw data in the Files area of the Lakehouse (default is true)
store_raw_data = True
path_raw = 'Files/Raw Capacity Metrics History/'

StatementMeta(, 13d3a19c-ed66-4053-a1b7-167258d4a7eb, 4, Finished, Available, Finished)

In [None]:
%%sql
-- Create a table to store the processing log (if it does not already exist)
CREATE TABLE IF NOT EXISTS processing_log AS 
SELECT current_timestamp() AS Timestamp, CAST('2000-01-01' AS date) AS Watermark;

In [None]:
#Get the latest watermark from the processing_log
query = """ SELECT MAX(Watermark) AS Watermark FROM processing_log"""
df = spark.sql(query)
watermark = df.first()['Watermark']
start_time = datetime.now()
start_date = date.today()

#Parset date parts
start_year = str(start_time.year)
start_month = str(start_time.month)
start_day = str(start_time.day)

## Extract and Store Item Catalog

In [None]:
#Extract the catalog of items from the Fabric Capacity Metrics App semantic model and store as parquet
dax_query = """
    EVALUATE
    SUMMARIZECOLUMNS (
        Items[Billable type],
        Items[capacityId],
        Items[dcount_Identity],
        Items[IsVirtualArtifactName],
        Items[IsVirtualArtifactStatus],
        Items[IsVirtualWorkspaceName],
        Items[IsVirtualWorkspaceStatus],
        Items[ItemId],
        Items[ItemKey],
        Items[ItemKind],
        Items[ItemName],
        Items[Timestamp],
        Items[UniqueKey],
        Items[WorkspaceId],
        Items[WorkspaceName]
    )"""

df_items = fabric.evaluate_dax(
    workspace = source_workspace,
    dataset = source_dataset,
    dax_string = dax_query,
    verbose = 1
    )

#Convert Fabric DataFrames into Spark DataFrames
dfs_items = spark.createDataFrame(df_items)
#Clean up column names
dfs_items = dfs_items.select([F.col(x).alias(x.replace('Items[','').replace(' ', '_').replace('[', '').replace(']', '').replace('(', '').replace(')','')) for x in dfs_items.columns])
#Create a temporary view on top of the data frame
dfs_items.createOrReplaceTempView('current_items')

if store_raw_data:
    #Set path and file name
    subfolder = 'Items/' + start_year + '/' + start_month + '/' + start_day + '/'
    file_name = start_time.strftime("%H-%M-%S")

    #Save DataFrames to OneLake
    dfs_items.write.mode("overwrite").format("parquet").save(path_raw + subfolder + file_name)

#display(df_items)

In [None]:
%%sql
-- Create a table to store items catalog (if it does not already exist)
CREATE TABLE IF NOT EXISTS items AS 
SELECT * FROM current_items LIMIT 0; 

-- Merge currenly retrieved items into the permanent items table
MERGE INTO items AS target
USING current_items AS source
ON source.UniqueKey = target.UniqueKey
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

## Extract and Store Hourly Capacity Metrics

In [None]:
#Extract the hourly metrics by item by operation from the Fabric Capacity Metrics App semantic model and store as parquet
#The data is pre-filtered from start date forward
dax_query = """
EVALUATE
    SUMMARIZECOLUMNS (
        'MetricsByItemandOperationandHour'[UniqueKey],
        'MetricsByItemandOperationandHour'[ItemId],
        'MetricsByItemandOperationandHour'[WorkspaceId],
        'MetricsByItemandOperationandHour'[PremiumCapacityId],
        'MetricsByItemandOperationandHour'[DateTime],
        'MetricsByItemandOperationandHour'[OperationName],
        'MetricsByItemandOperationandHour'[sum_CU],
        'MetricsByItemandOperationandHour'[sum_duration],
        'MetricsByItemandOperationandHour'[Throttling (min)],
        'MetricsByItemandOperationandHour'[count_users],
        'MetricsByItemandOperationandHour'[count_successful_operations],
        'MetricsByItemandOperationandHour'[count_rejected_operations],
        'MetricsByItemandOperationandHour'[count_operations],
        'MetricsByItemandOperationandHour'[count_Invalid_operations],
        'MetricsByItemandOperationandHour'[count_InProgress_operations],
        'MetricsByItemandOperationandHour'[count_failure_operations],
        'MetricsByItemandOperationandHour'[count_cancelled_operations],
        'MetricsByItemandOperationandHour'[avg_DurationMS],
        'MetricsByItemandOperationandHour'[percentile_DurationMs_50],
        'MetricsByItemandOperationandHour'[percentile_DurationMs_90],
    	FILTER ('MetricsByItemandOperationandHour', 
            'MetricsByItemandOperationandHour'[DateTime] >= DATE(""" + start_year + ", " + start_month + ", " + start_day + """)
            && ('MetricsByItemandOperationandHour'[sum_CU] > 0 
                || 'MetricsByItemandOperationandHour'[sum_duration] > 0 
                || 'MetricsByItemandOperationandHour'[count_operations] > 0)
            ) 
        )
    """

df_metrics = fabric.evaluate_dax(
    workspace = source_workspace,
    dataset = source_dataset,
    dax_string = dax_query,
    verbose = 1
    )
         
#Convert Fabric DataFrames into Spark DataFrames
dfs_metrics = spark.createDataFrame(df_metrics)
#Clean up column names
dfs_metrics = dfs_metrics.select([F.col(x).alias(x.replace(' ', '_').replace('[', '').replace(']', '').replace('(', '').replace(')','').replace('MetricsByItemandOperationandHour','')) for x in dfs_metrics.columns])
#Create a temporary view on top of the data frame
dfs_metrics.createOrReplaceTempView('current_metrics')

if store_raw_data:
    #Set path and file name
    subfolder = 'Hourly Metrics/' + start_year + '/' + start_month + '/' + start_day + '/'
    file_name = start_time.strftime("%H-%M-%S")
    #Save DataFrames to OneLake
    dfs_metrics.write.mode("overwrite").format("parquet").save(path_raw + subfolder + file_name)

#display(dfs_metrics)

In [None]:
%%sql
-- Create a table to store hourly capacity metrics (if it does not already exist)
CREATE TABLE IF NOT EXISTS hourly_capacity_metrics AS 
SELECT * FROM current_metrics LIMIT 0; 

-- Merge currenly retrieved metrics into the permanent table
MERGE INTO hourly_capacity_metrics AS target
USING current_metrics AS source
ON source.UniqueKey = target.UniqueKey
    AND source.DateTime = target.DateTime
    AND source.OperationName = target.OperationName
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

## Log Completion

In [None]:
#Add a record to processing log to indicate completion of a data loading operation
query = "INSERT INTO processing_log (Timestamp, Watermark) SELECT '" + format(start_date) + "', '" + format(start_time) + "'"
spark.sql(query)

## Analyze Results

Following is a sample query that displays the top 100 items based on resource consumption.

In [None]:
%%script echo "Skipping this cell, since it is provided only for illustration purposes and is not needed for data extraction and storage. To execute the sample query shown below, please delete this line."
%%sql

SELECT I.capacityId, I.WorkspaceName, I.ItemKind, I.ItemName, ROUND(SUM(M.sum_CU),2) AS CapacityUnitSecondsConsumed
FROM items I 
    INNER JOIN hourly_capacity_metrics M 
        ON I.UniqueKey = M.UniqueKey
GROUP BY I.capacityId, I.WorkspaceName, I.ItemKind, I.ItemName
ORDER BY CapacityUnitSecondsConsumed DESC
LIMIT 100