# Analyze OneLake data volumes in Microsoft Fabric

### Purpose and Overview
This notebook helps Microsoft Fabric workspace administrators manage and track data stored in OneLake.
* Recursively traverse folders and files in a given Workspace or folder path. 
* Returns a list of all files within  the path broken out by path segment and enrich the data with relevant attributes, such as:
    - File Extension
    - Modification Time
    - Size
    - Etc.
* Provides a few basic summaries of the collected data: total for the entire path, breakdown by file extension, by Fabric item, by file extension, etc.
* Stores a snapshot of collected data (at the granularity of one record per file) into a Lakehouse table for historical record keeping and subsequent analysis.

### Usage Instructions
This notebook is intended to be executed in a Microsoft Fabric workspace, which is hosted on a Fabric Capacity. The user executing this notebook must have access to the workspace and items whose storage usage is being analyzed (if the current user does not have permissions to a path segment, that segment will be skipped). This notebook can  be used to analyze OneLake storage for any workspace in your Fabric Tenant to which you have sufficient permissions (i.e., this notebook may reside in workspace A while analyzing storage in Workspaces B or C or D).

1. Download and import this Spark Notebook into a Fabric Workspace.
1. Attach a Default Lakehouse to this notebook. This Lakehouse will be used to store snapshots of data collected by this notebook. This Lakehouse may reside in any Fabric workspace (which may be different from the workspace where this Notebook is located or the workspace whose storage is being analyzed). The person executing this notebook must have full access to this Lakehouse.
1. Specify values for the *workspace_name*, *folder_path*, and *snapshot_table_name* parameters in a code cell below.
1. Press the "Run all" button to run all code cells
1. Review the results of summaries provided within the notebook
1. If desired, analyze data from the Lakehouse table where snapshots of collected data has been stored. Note that the Lakehouse table includes *RootPath* and *SnapshotTimestamp* columns that can be used to select data for a specific snapshot.

If necessary, repeat the process for additional Workspaces or folder paths.

### Considerations
Collecting a list of files stored in OneLake can be a lengthy process for folder paths with large numbers of files. The duration of the process is influenced by the number of files (not the size of files). This notebook will typically process between hundreds of thousands to a few million files per hour. Due to the way in which storage data collection is parallelized, throughput will be higher when each level of folder hierarchy contains many "balanced" folders (with each folder containing similar numbers of files). Throughput will be lower when most files at any level of folder hierarchy are stored in one (or only a few) large folders.

### Important Notice!
<mark>Data volumes returned by this notebook **will not match data volumes reported by the Fabric Capacity Metrics App or that may appear on your Microsoft Azure bill for your Fabric Capacity** due to substantial differences in methodologies for calculating data volumes.</mark>
Following are a few factors that may result in different data data volume calculations:
* This notebook will include files available via shortcuts, even if these files do not physically reside in this OneLake path (and are merely logically referenced within this path). 
* This notebook may include storage that is considered non-billable (e.g., mirroring storage).  
* This notebook will skip any paths to which the user running this notebook does not have permissions. 
* This notebook will not include files that have been deleted, but are being temporarily retained to provide protection from accidental deletion (soft-deleted data in OneLake is billed at the same rate as active data).
* This notebook will not capure data for OneLake Cache storage (used by certain items, such as Eventhouses).
* The timing of when your storage is calculated will differ from the timing used to calculate billable storage.

Therefore, while this notebook can serve as a helpful tool in analyzing your storage, *it should not be used to estimate your billable storage*!


In [None]:
#Specify Fabric workspace name (or leave blank to collect data for the current workspace)
workspace_name = ""

#Specify folder path within the workspace (or leave blank to collect data for all items in the workspace)
folder_path = ""

#Store a snashot of the list of files into the following table in the Default Lakehouse attached to this Notebook
snapshot_table_name = "storage_snapshot"

In [None]:
#Import relevant libraries
import sempy.fabric as fabric
import time

#Make sure that the current notebook is attached to a default Lakehouse
lh = fabric.get_lakehouse_id()
if lh == '' or lh == None:
    raise Exception('The Notebook is not attached to a default lakehouse. Please attache a default lakehouse and try again.')

#Resolve workspace Id
workspace_id = fabric.resolve_workspace_id(None if workspace_name == "" else workspace_name)
if workspace_id == '' or workspace_id == None:
    raise Exception('Workspace name is not valid. Please specify a valid Workspace Name or Workspace name.')

path = f"abfss://" + str.replace(workspace_id + "@onelake.dfs.fabric.microsoft.com/" + folder_path + '/', "//","/")

start_time = time.time()
print("Processing the following path: " + path)

In [None]:
import pandas as pd
from notebookutils import mssparkutils
from typing import List
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures
from pyspark.sql import functions as F

def get_descendants(path):
    # Recursively get files in a directory path
    new_paths = mssparkutils.fs.ls(path)
    files = []

    def get_children(path):
        try:
            print(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S") + " + Processed: " + path.path)
            return mssparkutils.fs.ls(path.path)
        except Exception as e:
            print(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S") + " ! Error: " + path.path + f". Skipping this path! Error message: {e}")
            return [None]

    while len(new_paths) > 0:
        current_paths = []
        for path in new_paths:
            if path.isDir:
                current_paths.append(path)
        
        new_paths = []

        #Parallelize retrieval of child items from each path
        with ThreadPoolExecutor(max_workers=16) as executor:
            results = executor.map(get_children, current_paths)
            children = sum(results, [])
            for child in children:
                if child == None:
                    print(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S") + ' * Ignoring skipped path!')
                elif child.isDir:
                    new_paths.append(child)
                else:
                    files.append(child)
        print(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S") + f" * Files processed : {len(files)}")
    return files

def file_list_to_dataframe(files):
    if len(files) > 0:
        # Convert file list into a Pandas data frame and enrich the data to facilitate analysis
        schema = ['path','name','size', 'modifyTime']
        df = pd.DataFrame([[getattr(file, attribute) for attribute in schema] for file in files], columns = schema).sort_values('path')
        df['FileExtension'] = df['name'].str.split(pat='.', expand=False, n=1).str[0]
        df['path'] = df['path'].str.replace("abfss://", "", case=False, regex=False)
        loc_0 = df.columns.get_loc('path')
        df_split = df['path'].str.split(pat='/', expand=True).add_prefix('Path_')
        df = pd.concat([df.iloc[:, :loc_0], df_split, df.iloc[:, loc_0:]], axis=1)
        df['SizeInMB'] = df['size']/(pow(2, 20))
        df['modifyTime'] = pd.to_datetime(df['modifyTime'], unit = 'ms')
        df = df.drop(columns=['size'])
        df = df.rename(columns={'path': 'FullPath'})
        df['Depth'] = df['FullPath'].str.count('/')
        return df
    else:
        print('No files discovered.')
        return pd.DataFrame([])
        

#Get all files from path
df = file_list_to_dataframe(get_descendants(path))

#Write data to a table
if len(df) > 0:
    spark_df = spark.createDataFrame(df)
    spark_df = spark_df.withColumn("RootPath", F.lit(path)).withColumn('SnapshotTimestamp', F.current_date())
    spark_df.write.option("mergeSchema", "true").mode("append").format("delta").saveAsTable(snapshot_table_name)
    print(pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S") + f" * Stored a snapshot of the file list into the {snapshot_table_name} table.")
else:
    raise Exception("No files discovered. Please provide a Workspace and Path containing one or more files.")

In [None]:
#Summarize data for the entire path
dfSummary = df.groupby(['Path_0']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data by Fabric Item
dfSummary = df.groupby(['Path_1']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data by File Extension
dfSummary = df.groupby(['FileExtension']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data for the first 3 path segments
dfSummary = df.groupby(['Path_1', 'Path_2', 'Path_3']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Return the full list of files
display(df)