# Analyze OneLake Storage Usage in a Fabric Workspace

### Overview
* Recursively traverse folders and files in a given Workspace or OneLake path. 
* Return a list of all files within  the path broken out by path segment and enrich the data with relevant attributes, such as:
    - File Extension
    - Modification Time
    - Size
    - Etc.
* Provide a few basic summaries of the collected data: total for the entire path, breakdown by file extension, by Fabric Item, etc.

### Usage Instructions
This notebook is intended to be executed in a Microsoft Fabric workspace. The person executing this notebook must have access to workspaces and items whose storage usage is being analyzed.
1. Download and import this Spark Notebook into a Fabric Workspace. The workspace must be hosted on a Fabric (or Premium) Capacity. 
1. Specify values for the *workspace_name* and *folder_path* parameters in a code cell below
1. Run all code cells
1. Review the results
1. If desired, you may extend this notebook to store the list of files to a repository of your choice (e.g., a Fabric Lakehouse) to maintain a historical record of your storage usage.

### Limitations
This notebook will attempt to use "user-friendly" workspace names and item names. However, if user-friendly names contain characters that cannot be safely used in URLs, item identifiers will be replaced by globally-unique identifiers GUIDs. Also note that Fabric Item Types cannot be properly detected for items identified using GUIDs.

### Important Notice!
<mark>Data volumes returned by this notebook **will not match data volumes reported by the Fabric Capacity Metrics App** due to substantial differences in methodologies for calculating data volumes. For example, this notebook will include files available via shortcuts, even if these files do not physically reside in this OneLake path (and are merely logically referenced within this path). Similarly, this notebook may include storage that is considered non-billable (e.g., mirroring storage). Furthermore, this notebook will not include files that have been deleted, but are being temporarily retained to provide protection from accidental deletion (soft-deleted data in OneLake is billed at the same rate as active data). 
Therefore, while this notebook can serve as a helpful tool in analyzing your storage, *it should not be used to estimate billable storage.*</mark>


In [None]:
#Specify Fabric workspace name (or leave blank to collect data for the current workspace)
workspace_name = ""

#Specify folder path within the workspace (or leave blank to collect data for all items in the workspace)
folder_path = "" 

In [None]:
#Import relevant libraries
import sempy.fabric as fabric
import urllib.parse

#Resolve workspace name
workspace_name = fabric.resolve_workspace_name(None if workspace_name == "" else workspace_name)
#URL encode workspace name
workspace_name_quoted = urllib.parse.quote_plus(workspace_name)

#If workspace name is the same as it's encoded name, use the workspace name, otherwise use the GUID:
if workspace_name == workspace_name_quoted:
    workspace_id = workspace_name
else:
    workspace_id = fabric.resolve_workspace_id(workspace_name)

path = f"abfss://" + str.replace(workspace_id + "@onelake.dfs.fabric.microsoft.com/" + folder_path + '/', "//","/")

print("Processing the following path: " + path)

In [None]:
import pandas as pd
from notebookutils import mssparkutils
from typing import List
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures

def get_descendants(path):
    # Recursively get files in a directory path
    new_paths = mssparkutils.fs.ls(path)
    files = []

    def get_children(path):
        return mssparkutils.fs.ls(path.path)

    while len(new_paths) > 0:
        current_paths = new_paths
        new_paths = []

        #Parallelize retrieval of child items from each path
        with ThreadPoolExecutor(max_workers=16) as executor:
            results = executor.map(get_children, current_paths)
            children = sum(results, [])
            for child in children:
                if child.isDir:
                    new_paths.append(child)
                else:
                    files.append(child)
    return files

def file_list_to_dataframe(files):
    # Convert file list into a Pandas data frame and enrich the data to facilitate analysis
    schema = ['path','name','size', 'modifyTime']
    df = pd.DataFrame([[getattr(file, attribute) for attribute in schema] for file in files], columns = schema).sort_values('path')
    loc_0 = df.columns.get_loc('name')
    df_split = df['name'].str.split(pat='.', expand=True, n=1).add_prefix('name_')
    df = pd.concat([df.iloc[:, :loc_0], df_split, df.iloc[:, loc_0:]], axis=1)
    df = df.drop(columns=['name'])
    df = df.rename(columns={'name_1': 'FileExtension'})
    df = df.drop(columns=['name_0'])
    df['path'] = df['path'].str.replace("abfss://", "", case=False, regex=False)
    loc_0 = df.columns.get_loc('path')
    df_split = df['path'].str.split(pat='/', expand=True).add_prefix('Path_')
    df = pd.concat([df.iloc[:, :loc_0], df_split, df.iloc[:, loc_0:]], axis=1)
    df['SizeInMB'] = df['size']/(pow(2, 20))
    df['modifyTime'] = pd.to_datetime(df['modifyTime'], unit = 'ms')
    df = df.drop(columns=['size'])
    df = df.rename(columns={'path': 'FullPath'})
    df['Depth'] = df['FullPath'].str.count('/')
    df['ItemType'] = df['Path_1'].str.rsplit(pat='.', expand=False, n=1).str[-1]
    return df

#Get all files from path
df = file_list_to_dataframe(get_descendants(path))   

In [None]:
#Summarize data for the entire path
dfSummary = df.groupby(['Path_0']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data by Fabric Item Type
#Note: item types cannot be properly detected for items identified by GUIDs (rather than user-friendly names) in this scenario, item GUIDs will appear in the ItemType column.
dfSummary = df.groupby(['ItemType']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data by Fabric Item
dfSummary = df.groupby(['Path_1']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data by File Extension
dfSummary = df.groupby(['FileExtension']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Summarize data for the first 3 path segments
dfSummary = df.groupby(['Path_1', 'Path_2', 'Path_3']).agg(SizeInMB=('SizeInMB', 'sum'), FileCount=('Path_0', 'count'), MeanFileSize=('SizeInMB', 'mean'), MedianFileSize=('SizeInMB', 'median'), MaxFileSize=('SizeInMB', 'max'), LatestModification=('modifyTime', 'max'), MaxDepth=('Depth', 'max')).reset_index()
display(dfSummary)

In [None]:
#Return the full list of files
display(df)