# Enhanced Fabric Workspace Scanner v02 - Refactored

## Features:
- **Lakehouse Storage**: Saves all analysis results to dedicated lakehouse tables
- **Enhanced Context**: Additional context columns for Reports, Tables, Relationships, Dataflows
- **Column Usage Analysis**: Detailed column usage analysis with context from measures, relationships, and dependencies
- **🆕 Optimized Code**: Eliminated repetitive loops and function-based approach for better maintainability

## Tables Created in Lakehouse:
- `workspace_analysis` - Workspace information
- `dataset_analysis` - Datasets with Reports, Tables, Relationships, Dataflows context
- `table_analysis` - Tables with usage context from measures, relationships, dependencies
- `column_usage_analysis` - Columns with detailed usage analysis
- `usage_summary` - Summary of dataset usage patterns

## Code Improvements:
- Single dataset processing loop with comprehensive data collection
- Reusable functions for dataset analysis
- Cached data structures to avoid redundant API calls
- Clear separation of concerns between data collection and analysis


In [1]:
# Install semantic-link-labs for extended Fabric analytics
!pip install semantic-link-labs

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 3, Finished, Available, Finished)

Collecting semantic-link-labs
  Downloading semantic_link_labs-0.12.4-py3-none-any.whl.metadata (27 kB)
Collecting semantic-link-sempy>=0.12.1 (from semantic-link-labs)
  Downloading semantic_link_sempy-0.12.1-py3-none-any.whl.metadata (11 kB)
Collecting anytree (from semantic-link-labs)
  Downloading anytree-2.13.0-py3-none-any.whl.metadata (8.0 kB)
Collecting polib (from semantic-link-labs)
  Downloading polib-1.2.0-py2.py3-none-any.whl.metadata (15 kB)
Collecting jsonpath_ng (from semantic-link-labs)
  Downloading jsonpath_ng-1.7.0-py3-none-any.whl.metadata (18 kB)
Collecting fabric-analytics-sdk==0.0.1 (from fabric-analytics-sdk[online-notebook]==0.0.1->semantic-link-sempy>=0.12.1->semantic-link-labs)
  Downloading fabric_analytics_sdk-0.0.1-py3-none-any.whl.metadata (14 kB)
Collecting azure-keyvault-secrets>=4.7.0 (from semantic-link-sempy>=0.12.1->semantic-link-labs)
  Downloading azure_keyvault_secrets-4.10.0-py3-none-any.whl.metadata (18 kB)
Collecting fabric-analytics-notebook

In [2]:
import pandas as pd
import sempy_labs
import sempy.fabric as fabric
from sempy_labs.report import ReportWrapper
import re
import sempy
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import ArrayType, StringType, StructType, LongType, StructField, FloatType
from pyspark.sql.functions import col
from datetime import datetime
import time
from dataclasses import dataclass
from typing import Dict, List, Optional

# Initialize Spark session
spark = SparkSession.builder.getOrCreate()

print("✅ All imports successful and Spark session initialized")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 4, Finished, Available, Finished)

✅ All imports successful and Spark session initialized


In [16]:
# ============================================================
# UTILITY FUNCTIONS AND DATA STRUCTURES
# ============================================================

@dataclass
class DatasetInfo:
    """Data structure to hold comprehensive dataset information"""
    ds_id: str
    ds_name: str
    ws_id: str
    ws_name: str
    dependencies_df: Optional[pd.DataFrame] = None
    tables_df: Optional[pd.DataFrame] = None
    relationships_df: Optional[pd.DataFrame] = None
    measures_df: Optional[pd.DataFrame] = None
    columns_df: Optional[pd.DataFrame] = None

def sanitize_df_columns(df, extra_columns=False, ws_id=None, ds_id=None, ws_name=None, ds_name=None):
    """
    Replaces spaces in column names with underscore to prevent errors during Spark Dataframe Creation
    """
    if df.empty:
        return df
        
    df.columns = [
        re.sub(r'\W+', "_", col.strip().lower())
        for col in df.columns
    ]

    if extra_columns:
        df['workspace_id'] = ws_id
        df['dataset_id'] = ds_id
        df['workspace_name'] = ws_name
        df['dataset_name'] = ds_name
        
    return df

def save_to_lakehouse(df, table_name, description=""):
    """
    Save DataFrame to lakehouse using Spark
    """
    try:
        if df.empty:
            print(f"  ⚠️ Skipping empty DataFrame for table: {table_name}")
            return
            
        # Add analysis timestamp
        df_with_timestamp = df.copy()
        df_with_timestamp['analysis_date'] = datetime.now()
        
        # Convert to Spark DataFrame and save
        spark_df = spark.createDataFrame(df_with_timestamp)
        spark_df.write.mode("overwrite").saveAsTable(table_name)
        
        print(f"  ✅ Saved {len(df)} records to '{table_name}' table")
        if description:
            print(f"     📝 {description}")
            
    except Exception as e:
        print(f"  ❌ Error saving to {table_name}: {str(e)}")

def collect_dataset_info(ds_id: str, ds_name: str, ws_id: str, ws_name: str) -> DatasetInfo:
    """
    🆕 Centralized function to collect all dataset-related information in one go
    🆕 Improved: Individual error handling for each API call to prevent blocking
    """
    print(f"🔹 Processing dataset: {ds_name} (Workspace: {ws_name})")
    
    dataset_info = DatasetInfo(ds_id, ds_name, ws_id, ws_name)
    
    # Get model dependencies - separate try-catch to not block other operations
    try:
        deps = fabric.get_model_calc_dependencies(dataset=ds_id, workspace=ws_id)
        with deps as calc_deps:
            dependencies_df = getattr(calc_deps, "dependencies_df", None)
        
        if dependencies_df is not None and not dependencies_df.empty:
            dependencies_df = sanitize_df_columns(
                df = dependencies_df, 
                extra_columns= True,
                ws_id = ws_id, 
                ds_id= ds_id,
                ws_name= ws_name,
                ds_name= ds_name
            )
            dataset_info.dependencies_df = dependencies_df
            print(f"  Found {len(dependencies_df)} dependencies")
        else:
            dataset_info.dependencies_df = pd.DataFrame()
            print(f"  No dependencies found for {ds_name}")
    except Exception as e:
        print(f"  ⚠️ Dependencies unavailable for {ds_name}: {e}")
        dataset_info.dependencies_df = pd.DataFrame()

    # Get tables
    try:
        tables = fabric.list_tables(dataset=ds_id, workspace=ws_id)
        if not tables.empty:
            tables = sanitize_df_columns(
                df = tables, 
                extra_columns = True,
                ws_id = ws_id, 
                ds_id = ds_id,
                ws_name = ws_name,
                ds_name= ds_name
            )
            dataset_info.tables_df = tables
            print(f"  Found {len(tables)} tables")
    except Exception as e:
        print(f"  ⚠️ Tables unavailable for {ds_name}: {e}")
        
    # Get relationships
    try:
        relationships = fabric.list_relationships(dataset=ds_id, workspace=ws_id, extended=True)
        if not relationships.empty:
            relationships = sanitize_df_columns(df = relationships)
            relationships['qualified_from'] = "'" + relationships['from_table'] + "'[" + relationships['from_column'] + "]"
            relationships['qualified_to'] = "'" + relationships['to_table'] + "'[" + relationships['to_column'] + "]"
            dataset_info.relationships_df = relationships
            print(f"  Found {len(relationships)} relationships")
    except Exception as e:
        print(f"  ⚠️ Relationships unavailable for {ds_name}: {e}")

    # Get measures
    try:
        measures = fabric.list_measures(dataset=ds_id, workspace=ws_id)
        if not measures.empty:
            measures = sanitize_df_columns(df = measures)
            dataset_info.measures_df = measures
            print(f"  Found {len(measures)} measures")
    except Exception as e:
        print(f"  ⚠️ Measures unavailable for {ds_name}: {e}")

    # Get columns
    try:
        columns = fabric.list_columns(dataset=ds_id, workspace=ws_id, extended=True)
        if not columns.empty:
            columns = sanitize_df_columns(
                df = columns,
                extra_columns= True,
                ws_id = ws_id, 
                ds_id= ds_id,
                ws_name= ws_name,
                ds_name= ds_name
            )
            columns['qualified_name'] = "'" + columns['table_name'] + "'[" + columns['column_name'] + ']'
            dataset_info.columns_df = columns
            print(f"  Found {len(columns)} columns")
    except Exception as e:
        print(f"  ⚠️ Columns unavailable for {ds_name}: {e}")
    
    return dataset_info

def analyze_table_usage(dataset_info: DatasetInfo) -> List[Dict]:
    """
    🆕 Analyze table usage for a single dataset using pre-collected data
    """
    table_usage = []
    
    if dataset_info.tables_df is None or dataset_info.tables_df.empty:
        return table_usage

    # display(dataset_info.measures_df)
    
    # Determine used tables from all sources
    used_tables = set()
    
    if dataset_info.dependencies_df is not None and not dataset_info.dependencies_df.empty:
        used_tables.update(set(dataset_info.dependencies_df['referenced_table'].dropna()))
    
    if dataset_info.relationships_df is not None:
        used_tables.update(set(dataset_info.relationships_df['from_table'].dropna()))
        used_tables.update(set(dataset_info.relationships_df['to_table'].dropna()))
    
    if dataset_info.measures_df is not None:
        used_tables.update(set(dataset_info.measures_df['table_name'].dropna()))
    
    used_tables = {t for t in used_tables if pd.notna(t)}
    
    # Analyze each table
    for table_name in set(dataset_info.tables_df['name'].dropna()):
        measures_count = 0
        if dataset_info.measures_df is not None:
            measures_count = len(dataset_info.measures_df[dataset_info.measures_df['table_name'] == table_name])
        
        rel_count = 0
        if dataset_info.relationships_df is not None:
            rel_count = len(dataset_info.relationships_df[
                (dataset_info.relationships_df['from_table'] == table_name) | 
                (dataset_info.relationships_df['to_table'] == table_name)
            ])
        
        dep_count = 0
        if (dataset_info.dependencies_df is not None and 
            not dataset_info.dependencies_df.empty and 
            'referenced_table' in dataset_info.dependencies_df.columns):
            dep_count = len(dataset_info.dependencies_df[dataset_info.dependencies_df['referenced_table'] == table_name])
        
        status = "Unused" if table_name not in used_tables else "Used"
        
        table_usage.append({
            'workspace': dataset_info.ws_name,
            'dataset': dataset_info.ds_name,
            'table': table_name,
            'measures': measures_count,
            'relationships': rel_count,
            'dependencies': dep_count,
            'usage': status,
            'workspace_id': dataset_info.ws_id,
            'dataset_id': dataset_info.ds_id
        })
    
    return table_usage

def analyze_column_usage(dataset_info: DatasetInfo) -> List[Dict]:
    """
    🆕 Analyze column usage for a single dataset using pre-collected data
    """
    columns_usage = []
    
    if dataset_info.columns_df is None or dataset_info.columns_df.empty:
        return columns_usage
    
    # Prepare dependency analysis
    dep_columns_df = pd.DataFrame()
    if (dataset_info.dependencies_df is not None and 
        not dataset_info.dependencies_df.empty and 
        'referenced_object_type' in dataset_info.dependencies_df.columns):
        dep_columns_df = dataset_info.dependencies_df[
            dataset_info.dependencies_df['referenced_object_type'].isin(['Column', 'Calc Column'])
        ]
    
    # Extract subsets by object type
    measures_refs_df = pd.DataFrame()
    relationship_refs_df = pd.DataFrame()
    
    if not dep_columns_df.empty and 'object_type' in dep_columns_df.columns:
        measures_refs_df = dep_columns_df[dep_columns_df['object_type'] == 'Measure']
        relationship_refs_df = dep_columns_df[
            dep_columns_df['object_type'].str.contains('Relationship', case=False, na=False)
        ]
    
    # Determine used columns
    dep_columns = set()
    if not dep_columns_df.empty and 'referenced_full_object_name' in dep_columns_df.columns:
        dep_columns = set(dep_columns_df['referenced_full_object_name'])
    rel_columns = set()
    
    if dataset_info.relationships_df is not None:
        rel_columns = set(dataset_info.relationships_df['qualified_from']).union(
            set(dataset_info.relationships_df['qualified_to'])
        )
    
    used_columns = dep_columns.union(rel_columns)
    used_columns = {c for c in used_columns if pd.notna(c)}
    
    # Analyze each column
    for _, row in dataset_info.columns_df.iterrows():
        table_name = row['table_name']
        column_name = row['column_name']
        qualified_name = row['qualified_name']
        
        if pd.isna(column_name):
            continue
        
        dep_count = 0
        if not dep_columns_df.empty and 'referenced_full_object_name' in dep_columns_df.columns:
            dep_count = len(dep_columns_df[dep_columns_df['referenced_full_object_name'] == qualified_name])
        
        # Safe column access with proper empty DataFrame handling
        measure_c = 0
        if not measures_refs_df.empty and 'referenced_full_object_name' in measures_refs_df.columns:
            measure_c = len(measures_refs_df[measures_refs_df['referenced_full_object_name'] == qualified_name])
        
        relationship_c = 0
        if not relationship_refs_df.empty and 'referenced_full_object_name' in relationship_refs_df.columns:
            relationship_c = len(relationship_refs_df[relationship_refs_df['referenced_full_object_name'] == qualified_name])
        
        # Build referenced-by list
        referenced_by = ""
        if not dep_columns_df.empty and all(col in dep_columns_df.columns for col in ['referenced_full_object_name', 'object_name']):
            referenced_by = ", ".join(
                dep_columns_df.loc[
                    dep_columns_df['referenced_full_object_name'] == qualified_name, 'object_name'
                ].unique().tolist()
            )
        
        usage_status = 'Used' if any([measure_c, relationship_c, dep_count]) else 'Unused'
        
        columns_usage.append({
            'workspace': dataset_info.ws_name,
            'dataset': dataset_info.ds_name,
            'table': table_name,
            'column': column_name,
            'measures': measure_c,
            'relationships': relationship_c,
            'dependencies': dep_count,
            'referenced_by': referenced_by,
            'usage': usage_status,
            'workspace_id': dataset_info.ws_id,
            'dataset_id': dataset_info.ds_id
        })
    
    return columns_usage

print("✅ Utility functions and data structures defined")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 18, Finished, Available, Finished)

✅ Utility functions and data structures defined


In [8]:
# ------------------------------------------------------------
# STEP 1: Object Discovery
# ------------------------------------------------------------

print("🔍 Discovering workspaces...")

workspaces_df = fabric.list_workspaces()
workspaces_df = sanitize_df_columns(workspaces_df)
workspaces_df = workspaces_df[['id', 'name', 'type']]
display(workspaces_df)

datasets_all, reports_all, paginated_all, dataflows_all = [], [], [], []

for _, ws in workspaces_df.iterrows():
    ws_id = ws['id']
    ws_name = ws['name']
    ws_type = ws['type']
    if ws_type == "AdminInsights":
        continue
    print(f"\n📦 Scanning workspace: {ws_name}")

   # --- Datasets
    try:
        ds = fabric.list_datasets(workspace=ws_id)
        if not ds.empty:
            ds['workspace_id'] = ws_id
            ds['workspace_name'] = ws_name
            datasets_all.append(ds)
    except Exception as e:
        print(f"  ⚠️ Datasets error in {ws_name}: {e}")

    # --- Reports (includes both Power BI and Paginated)
    try:
        rep = fabric.list_reports(workspace=ws_id)
        if not rep.empty:
            rep['workspace_id'] = ws_id
            rep['workspace_name'] = ws_name
            reports_all.append(rep)
    except Exception as e:
        print(f"  ⚠️ Reports error in {ws_name}: {e}")

    # --- Dataflows
    try:
        dfs = fabric.list_items(type='Dataflow',workspace=ws_id)
        if not dfs.empty:
            # dfs['workspace_id'] = ws_id
            # dfs['workspace_name'] = ws_name
            dataflows_all.append(dfs)
    except Exception as e:
        print(f"  ⚠️ Dataflows error in {ws_name}: {e}")

# Combine results
datasets_df  = sanitize_df_columns(pd.concat(datasets_all, ignore_index=True) if datasets_all else pd.DataFrame())
reports_df   = sanitize_df_columns(pd.concat(reports_all, ignore_index=True) if reports_all else pd.DataFrame())
dataflows_df = sanitize_df_columns(pd.concat(dataflows_all, ignore_index=True) if dataflows_all else pd.DataFrame())

# Split report types for clarity
if not reports_df.empty and "report_type" in reports_df.columns:
    pbi_reports_df = reports_df[reports_df["report_type"] == "PowerBIReport"].copy()
    paginated_reports_df = reports_df[reports_df["report_type"] == "PaginatedReport"].copy()
else:
    pbi_reports_df = reports_df
    paginated_reports_df = pd.DataFrame()

# 🆕 ADD OBJECT COUNTS TO WORKSPACE DATAFRAME
print("\n📊 Adding object counts to workspace dataframe...")

# Initialize count columns
workspaces_df['dataset_count'] = 0
workspaces_df['total_reports'] = 0
workspaces_df['pbi_reports'] = 0
workspaces_df['paginated_reports'] = 0
workspaces_df['dataflows'] = 0

# Count objects per workspace
if not datasets_df.empty:
    dataset_counts = datasets_df['workspace_id'].value_counts().to_dict()
    workspaces_df['dataset_count'] = workspaces_df['id'].map(dataset_counts).fillna(0).astype(int)

if not reports_df.empty:
    # Total reports count
    total_report_counts = reports_df['workspace_id'].value_counts().to_dict()
    workspaces_df['total_reports'] = workspaces_df['id'].map(total_report_counts).fillna(0).astype(int)
    
    # PBI reports count
    if not pbi_reports_df.empty:
        pbi_counts = pbi_reports_df['workspace_id'].value_counts().to_dict()
        workspaces_df['pbi_reports'] = workspaces_df['id'].map(pbi_counts).fillna(0).astype(int)
    
    # Paginated reports count
    if not paginated_reports_df.empty:
        paginated_counts = paginated_reports_df['workspace_id'].value_counts().to_dict()
        workspaces_df['paginated_reports'] = workspaces_df['id'].map(paginated_counts).fillna(0).astype(int)

if not dataflows_df.empty:
    dataflow_counts = dataflows_df['workspace_id'].value_counts().to_dict()
    workspaces_df['dataflows'] = workspaces_df['id'].map(dataflow_counts).fillna(0).astype(int)

print("\n✅ Object discovery complete with enhanced workspace context.")
print(f"  Workspaces: {len(workspaces_df)}")
print(f"  Datasets:   {len(datasets_df)}")
print(f"  Reports:    {len(reports_df)} (PBI: {len(pbi_reports_df)}, Paginated: {len(paginated_reports_df)})")
print(f"  Dataflows:  {len(dataflows_df)}")

# Display enhanced workspace summary
print("\n📋 Workspace Object Summary:")
workspace_summary = workspaces_df[['name', 'dataset_count', 'total_reports', 'pbi_reports', 'paginated_reports', 'dataflows']]
display(workspace_summary)

# Save to Lakehouse - Enhanced Workspaces
print("\n💾 Saving enhanced workspace data to lakehouse...")
save_to_lakehouse(workspaces_df, "workspace_analysis", "Workspace information with object counts")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 10, Finished, Available, Finished)

🔍 Discovering workspaces...


SynapseWidget(Synapse.DataFrame, d8bf99b8-2dd4-4a03-8953-abdf9c3671a3)


📦 Scanning workspace: Test Workspace

📦 Scanning workspace: Admin Test Workspace

📦 Scanning workspace: Modelling Workspace Test

📊 Adding object counts to workspace dataframe...

✅ Object discovery complete with enhanced workspace context.
  Workspaces: 4
  Datasets:   8
  Reports:    11 (PBI: 10, Paginated: 1)
  Dataflows:  1

📋 Workspace Object Summary:


SynapseWidget(Synapse.DataFrame, 5fe327ed-22f3-4cdc-82d5-0656bbb5d6a9)


💾 Saving enhanced workspace data to lakehouse...
  ✅ Saved 4 records to 'workspace_analysis' table
     📝 Workspace information with object counts


In [17]:
# ------------------------------------------------------------
# STEP 2: 🆕 CENTRALIZED DATASET PROCESSING
# ------------------------------------------------------------


# Collection containers for all analysis results
all_dataset_info = []
table_usage_results = []
column_usage_results = []
all_dependencies = []

# Single loop through all datasets - collect everything at once
for _, ds in datasets_df.iterrows():
    ds_id = ds['dataset_id']
    ds_name = ds['dataset_name']
    ws_id = ds['workspace_id']
    ws_name = ds['workspace_name']
    
    # 🆕 Single comprehensive data collection per dataset
    dataset_info = collect_dataset_info(ds_id, ds_name, ws_id, ws_name)
    all_dataset_info.append(dataset_info)
    
    # Collect dependencies for later aggregation
    if dataset_info.dependencies_df is not None and not dataset_info.dependencies_df.empty:
        all_dependencies.append(dataset_info.dependencies_df)
    
    # 🆕 Perform table analysis using collected data
    table_analysis = analyze_table_usage(dataset_info)
    table_usage_results.extend(table_analysis)
    
    # 🆕 Perform column analysis using collected data
    column_analysis = analyze_column_usage(dataset_info)
    column_usage_results.extend(column_analysis)

print(f"\n✅ Centralized processing complete!")
print(f"  📊 Processed {len(all_dataset_info)} datasets")
print(f"  📊 Analyzed {len(table_usage_results)} tables")
print(f"  📊 Analyzed {len(column_usage_results)} columns")
print(f"  📊 Collected {len(all_dependencies)} dependency sets")


StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 19, Finished, Available, Finished)

🔹 Processing dataset: New Waziri Dashboard Report (Workspace: Test Workspace)
  Found 244 dependencies
  Found 17 tables
  Found 9 relationships
  Found 19 measures
  Found 176 columns
🔹 Processing dataset: Jaffaa AS Report (Workspace: Test Workspace)
  Found 320 dependencies
  Found 11 tables
  Found 6 relationships
  Found 26 measures
  Found 124 columns
🔹 Processing dataset: maven semantic model (Workspace: Test Workspace)
  No dependencies found for maven semantic model
  Found 2 tables
  Found 57 columns
🔹 Processing dataset: Energy Consumption Dashboard (Workspace: Test Workspace)
  Found 550 dependencies
  Found 8 tables
  Found 8 relationships
  Found 26 measures
  Found 107 columns
🔹 Processing dataset: Fabric Analysis SM (Workspace: Test Workspace)
  Found 16 dependencies
  Found 8 tables
  Found 8 relationships
  Found 94 columns
🔹 Processing dataset: U14 GRMFC 2024 (Workspace: Admin Test Workspace)
  Found 3960 dependencies
  Found 20 tables
  Found 14 relationships
  Found

In [18]:
# ------------------------------------------------------------
# STEP 3: Usage Analysis and Enhanced Dataset Context
# ------------------------------------------------------------

print("\n🔎 Analyzing dataset usage and creating enhanced context...")

# 1️⃣ Dataset IDs used by any report (Power BI or Paginated)
used_dataset_ids = set()
if not reports_df.empty:
    used_dataset_ids.update(reports_df['dataset_id'].dropna().unique())

# 2️⃣ Dataset IDs used by dataflows (as sources)
dataflow_refs = []

for _, row in dataflows_df.iterrows():
    try:
        refs = sempy_labs.get_dataflow_references(row['id'], row['workspace_id'])
        if refs is not None and not refs.empty:
            refs['dataflow_id'] = row['id']
            refs['dataflow_name'] = row['name']
            refs['workspace_id'] = row['workspace_id']
            dataflow_refs.append(refs)
    except Exception:
        pass

dataflow_refs_df = pd.concat(dataflow_refs, ignore_index=True) if dataflow_refs else pd.DataFrame()

if not dataflow_refs_df.empty:
    if 'source_dataset_id' in dataflow_refs_df.columns:
        used_dataset_ids.update(dataflow_refs_df['source_dataset_id'].dropna().unique())

# 3️⃣ Determine unused datasets
unused_datasets_df = datasets_df[~datasets_df['dataset_id'].isin(used_dataset_ids)].copy()

print(f"✅ Found {len(unused_datasets_df)} potentially unused datasets.")

# Enhanced Dataset Analysis with Context
print("\n📊 Creating enhanced dataset analysis with context...")

# Add context columns for each dataset using pre-collected data
enhanced_datasets = datasets_df.copy()
if not enhanced_datasets.empty:
    enhanced_datasets['report_count'] = 0
    enhanced_datasets['dataflow_count'] = 0
    enhanced_datasets['table_count'] = 0
    enhanced_datasets['relationship_count'] = 0
    enhanced_datasets['is_used'] = enhanced_datasets['dataset_id'].isin(used_dataset_ids)
    
    # Count reports per dataset
    if not reports_df.empty:
        report_counts = reports_df.groupby('dataset_id').size().to_dict()
        enhanced_datasets['report_count'] = enhanced_datasets['dataset_id'].map(report_counts).fillna(0)
    
    # Count dataflow references per dataset
    if not dataflow_refs_df.empty and 'source_dataset_id' in dataflow_refs_df.columns:
        dataflow_counts = dataflow_refs_df.groupby('source_dataset_id').size().to_dict()
        enhanced_datasets['dataflow_count'] = enhanced_datasets['dataset_id'].map(dataflow_counts).fillna(0)
    
    # Add table and relationship counts using pre-collected data
    for dataset_info in all_dataset_info:
        mask = enhanced_datasets['dataset_id'] == dataset_info.ds_id
        
        if dataset_info.tables_df is not None:
            enhanced_datasets.loc[mask, 'table_count'] = len(dataset_info.tables_df)
        
        if dataset_info.relationships_df is not None:
            enhanced_datasets.loc[mask, 'relationship_count'] = len(dataset_info.relationships_df)

# Save Enhanced Dataset Analysis to Lakehouse
print("\n💾 Saving enhanced dataset analysis to lakehouse...")
save_to_lakehouse(enhanced_datasets, "dataset_analysis", 
                 "Datasets with Reports, Tables, Relationships, Dataflows context")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 20, Finished, Available, Finished)


🔎 Analyzing dataset usage and creating enhanced context...
✅ Found 0 potentially unused datasets.

📊 Creating enhanced dataset analysis with context...

💾 Saving enhanced dataset analysis to lakehouse...
  ✅ Saved 8 records to 'dataset_analysis' table
     📝 Datasets with Reports, Tables, Relationships, Dataflows context


In [19]:
# ------------------------------------------------------------
# STEP 4: Usage Summary Table Creation
# ------------------------------------------------------------

print("\n📋 Creating usage summary table...")

summary_records = []

for _, ds in datasets_df.iterrows():
    ds_id = ds['dataset_id']
    ds_name = ds['dataset_name']
    ws_name = ds['workspace_name']

    # Reports using this dataset
    rep_refs = pbi_reports_df[pbi_reports_df['dataset_id'] == ds_id]
    paginated_refs = rep_refs[rep_refs['report_type'] == 'PaginatedReport'] if 'report_type' in rep_refs.columns else pd.DataFrame()
    normal_refs = rep_refs[rep_refs['report_type'] != 'PaginatedReport'] if 'report_type' in rep_refs.columns else rep_refs

    # Dataflows referencing this dataset (if any)
    dataflow_refs = []
    if not dataflow_refs_df.empty and 'source_dataset_id' in dataflow_refs_df.columns:
        dataflow_refs = dataflow_refs_df[dataflow_refs_df['source_dataset_id'] == ds_id]

    # Determine usage
    total_refs = len(rep_refs) + len(dataflow_refs)
    usage_status = "Unused" if total_refs == 0 else "Used"

    # Add records for all associated reports
    if not rep_refs.empty:
        for _, r in rep_refs.iterrows():
            summary_records.append({
                "Dataset_Workspace": ws_name,
                "Dataset_Name": ds_name,
                "Report_Name": r['name'],
                "Report_Type": r.get('report_type', 'PowerBIReport'),
                "Report_Workspace": r['workspace_name'],
                "Usage_Status": usage_status,
                "Total_References": total_refs
            })
    # Add records for datasets with no references
    elif total_refs == 0:
        summary_records.append({
            "Dataset_Workspace": ws_name,
            "Dataset_Name": ds_name,
            "Report_Name": None,
            "Report_Type": None,
            "Report_Workspace": None,
            "Usage_Status": usage_status,
            "Total_References": total_refs
        })

usage_summary_df = pd.DataFrame(summary_records)
display(usage_summary_df)

# Save Usage Summary to Lakehouse
print("\n💾 Saving usage summary to lakehouse...")
save_to_lakehouse(usage_summary_df, "usage_summary", "Summary of dataset usage patterns")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 21, Finished, Available, Finished)


📋 Creating usage summary table...


SynapseWidget(Synapse.DataFrame, f7ced2bc-f4e9-4a95-af48-a0222c7799ed)


💾 Saving usage summary to lakehouse...
  ✅ Saved 9 records to 'usage_summary' table
     📝 Summary of dataset usage patterns


In [20]:
# ------------------------------------------------------------
# STEP 5: 🆕 RESULTS PROCESSING & LAKEHOUSE SAVING
# Process and save the pre-collected analysis results
# ------------------------------------------------------------

print("\n💾 Processing and saving analysis results to lakehouse...")

# Convert table analysis results to DataFrame
if table_usage_results:
    table_usage_df = pd.DataFrame(table_usage_results)
    display(table_usage_df)
    
    print("\n💾 Saving table analysis to lakehouse...")
    save_to_lakehouse(table_usage_df, "table_analysis", 
                     "Tables with usage context from measures, relationships, and dependencies")
else:
    print("⚠️ No table usage data to save")

# Convert column analysis results to DataFrame
if column_usage_results:
    columns_usage_df = pd.DataFrame(column_usage_results)
    display(columns_usage_df)
    
    print("\n💾 Saving column usage analysis to lakehouse...")
    save_to_lakehouse(columns_usage_df, "column_usage_analysis", 
                     "Detailed column usage analysis with context from measures, relationships, and dependencies")
else:
    print("⚠️ No column usage data to save")

print("\n✅ All analysis results saved to lakehouse!")

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 22, Finished, Available, Finished)


💾 Processing and saving analysis results to lakehouse...


SynapseWidget(Synapse.DataFrame, 4a015680-637a-4ff9-8853-d311a2008b1a)


💾 Saving table analysis to lakehouse...
  ✅ Saved 71 records to 'table_analysis' table
     📝 Tables with usage context from measures, relationships, and dependencies


SynapseWidget(Synapse.DataFrame, 541f019b-e9a5-433f-83cf-bfb512a0bf1e)


💾 Saving column usage analysis to lakehouse...
  ✅ Saved 867 records to 'column_usage_analysis' table
     📝 Detailed column usage analysis with context from measures, relationships, and dependencies

✅ All analysis results saved to lakehouse!


In [22]:
# ------------------------------------------------------------
# STEP 6: Final Summary and Performance Metrics
# ------------------------------------------------------------

print("\n" + "="*80)
print("🎉 FABRIC WORKSPACE ANALYSIS COMPLETE")
print("="*80)

# Summary statistics
print(f"📊 Discovery Summary:")
print(f"  Workspaces: {len(workspaces_df)}")
print(f"  Datasets:   {len(datasets_df)}")
print(f"  Reports:    {len(reports_df)}")
print(f"  Dataflows:  {len(dataflows_df)}")

if table_usage_results:
    used_tables = sum(1 for t in table_usage_results if t['usage'] == 'Used')
    unused_tables = sum(1 for t in table_usage_results if t['usage'] == 'Unused')
    print(f"  Tables:     {len(table_usage_results)} (Used: {used_tables}, Unused: {unused_tables})")

if column_usage_results:
    used_columns = sum(1 for c in column_usage_results if c['usage'] == 'Used')
    unused_columns = sum(1 for c in column_usage_results if c['usage'] == 'Unused')
    print(f"  Columns:    {len(column_usage_results)} (Used: {used_columns}, Unused: {unused_columns})")

print(f"\n💾 Lakehouse Tables Created:")
print(f"  📊 workspace_analysis - Basic workspace information")
print(f"  📊 dataset_analysis - Datasets with context (Reports, Tables, Relationships, Dataflows)")
print(f"  📊 table_analysis - Tables with usage context from measures, relationships, dependencies")
print(f"  📊 column_usage_analysis - Detailed column usage analysis")
print(f"  📊 usage_summary - Summary of dataset usage patterns")

# Display final unused datasets
if not unused_datasets_df.empty:
    print("\n⚠️ UNUSED DATASETS")
    for _, row in unused_datasets_df.iterrows():
        print(f" - {row['workspace_name']} → {row['dataset_name']}")
else:
    print("\n🎉 No unused datasets found!")

print("\n" + "="*80)
print("✅ Check your lakehouse for detailed results.")
print("="*80)

StatementMeta(, d44ea4f9-3148-407e-9dee-69a777d122d3, 24, Finished, Available, Finished)


🎉 FABRIC WORKSPACE ANALYSIS COMPLETE
📊 Discovery Summary:
  Workspaces: 4
  Datasets:   8
  Reports:    11
  Dataflows:  1
  Tables:     71 (Used: 63, Unused: 8)
  Columns:    867 (Used: 163, Unused: 704)

💾 Lakehouse Tables Created:
  📊 workspace_analysis - Basic workspace information
  📊 dataset_analysis - Datasets with context (Reports, Tables, Relationships, Dataflows)
  📊 table_analysis - Tables with usage context from measures, relationships, dependencies
  📊 column_usage_analysis - Detailed column usage analysis
  📊 usage_summary - Summary of dataset usage patterns

🎉 No unused datasets found!

✅ Check your lakehouse for detailed results.
