## Understanding Databricks Tagging Options

Databricks provides three distinct tagging systems, each serving different purposes within the platform:

1. **Resource-level tagging**: For attributing compute costs to teams, projects, or users
2. **Unity Catalog securable object tagging**: For organizing, classifying, and governing data assets
3. **Serverless compute workload tagging**: For tracking usage of serverless resources

These tagging systems enable better governance, cost management, and organization of both compute resources and data assets within Databricks.

## Resource-Level Tagging

Resource-level tags allow you to attribute compute usage to specific teams, projects, or cost centers with greater granularity than default tags. These tags propagate to both your account's usage logs and applicable cloud resources.

### Types of Resource Tags

There are two types of resource tags in Databricks:

1. **Default tags**: Automatically applied by Databricks to compute resources, providing basic metadata like vendor, cluster ID, and creator
2. **Custom tags**: User-defined tags that you add to resources for more granular tracking

### Supported Resources for Custom Tags

You can apply custom tags to the following resources:


| Object | Tagging interface | Python approach |
| :-- | :-- | :-- |
| Workspace | Cloud provider portal | Cloud provider API |
| Pool | Pools UI or Instance Pool API | Databricks API |
| Clusters (all-purpose and job) | Compute UI or Clusters API | Databricks API |
| SQL warehouse | SQL warehouse UI or Warehouses API | Databricks API |


In [0]:
pip install python-dotenv

In [0]:
from dotenv import load_dotenv
import os
    
load_dotenv()

TOKEN = os.getenv("TOKEN")
DATABRICKS_INSTANCE = os.getenv("DATABRICKS_INSTANCE")
CLUSTER_ID = os.getenv("CLUSTER_ID")
WAREHOUSE_ID = os.getenv("WAREHOUSE_ID")

print(f"TOKEN: {TOKEN}")
print(f"DATABRICKS_INSTANCE: {DATABRICKS_INSTANCE}")
print(f"CLUSTER_ID: {CLUSTER_ID}")
print(f"WAREHOUSE_ID: {WAREHOUSE_ID}")

### Implementation with Python

#### Tagging Clusters

In [0]:
import requests
import json

# Parameters - set these variables at the top of your notebook cell
CLUSTER_ID = CLUSTER_ID
DATABRICKS_INSTANCE = DATABRICKS_INSTANCE
TOKEN = TOKEN
TAGS_TO_ADD = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def tag_cluster(cluster_id, custom_tags, databricks_instance, token):
    """
    Add or update tags on an existing Databricks cluster.
    
    Parameters:
    - cluster_id: ID of the existing cluster
    - custom_tags: Dictionary of tag key-value pairs
    - databricks_instance: Your Databricks workspace URL
    - token: Your Databricks personal access token
    """
    # First, get current cluster configuration
    api_endpoint = f"https://{databricks_instance}/api/2.0/clusters/get"
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }
    
    response = requests.get(
        api_endpoint,
        headers=headers,
        params={"cluster_id": cluster_id}
    )
    
    if response.status_code != 200:
        print(f"Error getting cluster: {response.text}")
        return False
    
    # Get current configuration
    cluster_config = response.json()
    
    # Update the tags
    current_tags = cluster_config.get("custom_tags", {})
    current_tags.update(custom_tags)
    cluster_config["custom_tags"] = current_tags
    
    # Remove fields that cannot be included in edit request
    for field in ["creator_user_name", "start_time", "state", 
                 "state_message", "default_tags", "cluster_source"]:
        if field in cluster_config:
            del cluster_config[field]

    # Ensure cluster_id is included in the configuration
    cluster_config["cluster_id"] = cluster_id
    
    # Edit cluster with updated tags
    edit_endpoint = f"https://{databricks_instance}/api/2.0/clusters/edit"
    
    response = requests.post(
        edit_endpoint,
        headers=headers,
        data=json.dumps(cluster_config)
    )
    
    if response.status_code == 200:
        print(f"Successfully updated tags on cluster {cluster_id}")
        return True
    else:
        print(f"Error updating tags: {response.text}")
        return False

# Call the function with the parameters defined at the top
tag_cluster(
    cluster_id=CLUSTER_ID,
    custom_tags=TAGS_TO_ADD,
    databricks_instance=DATABRICKS_INSTANCE,
    token=TOKEN
)

#### Creating a New Cluster with Tags

In [0]:
import requests
import json

# Parameters - set these variables at the top of your notebook cell
CLUSTER_NAME = "My Databricks Cluster"
DATABRICKS_INSTANCE = DATABRICKS_INSTANCE
TOKEN = TOKEN
SPARK_VERSION = "11.3.x-scala2.12"
NODE_TYPE = "Standard_DS3_v2"
MIN_WORKERS = 1
MAX_WORKERS = 2
TAGS_TO_ADD = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def create_cluster_with_tags(cluster_name, custom_tags, databricks_instance, token, 
                            spark_version="11.3.x-scala2.12", node_type="Standard_DS3_v2", 
                            min_workers=1, max_workers=2):
    """
    Create a new Databricks cluster with custom tags.
    
    Parameters:
    - cluster_name: Name for the new cluster
    - custom_tags: Dictionary of tag key-value pairs
    - databricks_instance: Your Databricks workspace URL
    - token: Your Databricks personal access token
    - spark_version: Databricks Runtime version
    - node_type: VM type for the cluster nodes
    - min_workers, max_workers: Worker node count range
    """
    api_endpoint = f"https://{databricks_instance}/api/2.0/clusters/create"
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }
    
    # IMPORTANT: Don't use 'Name' as a custom tag key (reserved by Databricks)
    if "Name" in custom_tags:
        print("Warning: 'Name' is a reserved tag and has been removed")
        del custom_tags["Name"]
    
    cluster_config = {
        "cluster_name": cluster_name,
        "spark_version": spark_version,
        "node_type_id": node_type,
        "autoscale": {
            "min_workers": min_workers,
            "max_workers": max_workers
        },
        "custom_tags": custom_tags
    }
    
    response = requests.post(
        api_endpoint,
        headers=headers,
        data=json.dumps(cluster_config)
    )
    
    if response.status_code == 200:
        result = response.json()
        print(f"Successfully created cluster with ID: {result['cluster_id']}")
        return result['cluster_id']
    else:
        print(f"Error creating cluster: {response.text}")
        return None

# Call the function with the parameters defined at the top
cluster_id = create_cluster_with_tags(
    cluster_name=CLUSTER_NAME,
    custom_tags=TAGS_TO_ADD,
    databricks_instance=DATABRICKS_INSTANCE,
    token=TOKEN,
    spark_version=SPARK_VERSION,
    node_type=NODE_TYPE,
    min_workers=MIN_WORKERS,
    max_workers=MAX_WORKERS
)

#### Tagging SQL Warehouses

In [0]:
import requests
import json

# Parameters - set these variables at the top of your notebook cell
WAREHOUSE_ID = WAREHOUSE_ID
DATABRICKS_INSTANCE = DATABRICKS_INSTANCE
TOKEN = TOKEN
TAGS_TO_ADD = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def tag_sql_warehouse(warehouse_id, custom_tags, databricks_instance, token):
    """
    Add or update tags on a Databricks SQL warehouse.
    
    Parameters:
    - warehouse_id: ID of the SQL warehouse
    - custom_tags: Dictionary of tag key-value pairs
    - databricks_instance: Your Databricks workspace URL
    - token: Your Databricks personal access token
    """
    # First, get current warehouse configuration
    api_endpoint = f"https://{databricks_instance}/api/2.0/sql/warehouses/{warehouse_id}"
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }
    
    response = requests.get(
        api_endpoint,
        headers=headers
    )
    
    if response.status_code != 200:
        print(f"Error getting warehouse: {response.text}")
        return False
    
    # Get current configuration
    warehouse_config = response.json()
    
    # Update the tags
    current_tags = warehouse_config.get("tags", {})
    current_tags.update(custom_tags)
    warehouse_config["tags"] = current_tags
    
    # Edit warehouse with updated tags
    edit_endpoint = f"https://{databricks_instance}/api/2.0/sql/warehouses/{warehouse_id}/edit"
    
    # Prepare the required fields for the edit request
    edit_payload = {
        "id": warehouse_id,
        "name": warehouse_config["name"],
        "tags": current_tags
    }
    
    response = requests.post(
        edit_endpoint,
        headers=headers,
        data=json.dumps(edit_payload)
    )
    
    if response.status_code == 200:
        print(f"Successfully updated tags on SQL warehouse {warehouse_id}")
        return True
    else:
        print(f"Error updating tags: {response.text}")
        return False

# Call the function with the parameters defined at the top
tag_sql_warehouse(
    warehouse_id=WAREHOUSE_ID,
    custom_tags=TAGS_TO_ADD,
    databricks_instance=DATABRICKS_INSTANCE,
    token=TOKEN
)

## Unity Catalog Securable Object Tagging

Unity Catalog allows you to apply tags to various data assets to improve organization, classification, governance, and discoverability.

### Supported Securable Objects

You can apply tags to the following objects in Unity Catalog:

- Catalogs
- Schemas
- Tables (including views, materialized views, streaming tables)
- Table columns
- Volumes
- Registered models and model versions


### Tag Constraints and Requirements

- Maximum of 50 tags per securable object
- Maximum tag key length is 255 characters
- Maximum tag value length is 1000 characters
- Certain characters (`. , - = / :`) are not allowed in tag keys
- Tag search requires exact term matching (no wildcards)
- To add tags, you must own the object or have the `APPLY TAG` privilege, along with `USE SCHEMA` on the parent schema and `USE CATALOG` on the parent catalog

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"
TABLE_NAME = "diamonds"
TAGS_TO_APPLY = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def tag_table(catalog_name, schema_name, table_name, tags_dict):
    """
    Apply tags to a table in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - table_name: Name of the table
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set tags
    sql_command = f"ALTER TABLE {full_table_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to {full_table_name}")

# Call the function with the parameters defined at the top
tag_table(
    catalog_name=CATALOG_NAME,
    schema_name=SCHEMA_NAME,
    table_name=TABLE_NAME,
    tags_dict=TAGS_TO_APPLY
)

#### Tagging Table Columns

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"
TABLE_NAME = "diamonds"
COLUMN_NAME = "clarity"
TAGS_TO_APPLY = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def tag_table_column(catalog_name, schema_name, table_name, column_name, tags_dict):
    """
    Apply tags to a specific column in a table.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema  
    - table_name: Name of the table
    - column_name: Name of the column to tag
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set column tags
    sql_command = f"ALTER TABLE {full_table_name} ALTER COLUMN {column_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to column {column_name} in {full_table_name}")

# Call the function with the parameters defined at the top
tag_table_column(
    catalog_name=CATALOG_NAME,
    schema_name=SCHEMA_NAME,
    table_name=TABLE_NAME,
    column_name=COLUMN_NAME,
    tags_dict=TAGS_TO_APPLY
)

#### Tagging Schemas

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"
TAGS_TO_APPLY = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}

def tag_schema(catalog_name, schema_name, tags_dict):
    """
    Apply tags to a schema in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full schema reference
    full_schema_name = f"{catalog_name}.{schema_name}"
    
    # SQL command to set tags
    sql_command = f"ALTER SCHEMA {full_schema_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to schema {full_schema_name}")

# Call the function with the parameters defined at the top
tag_schema(
    catalog_name=CATALOG_NAME,
    schema_name=SCHEMA_NAME,
    tags_dict=TAGS_TO_APPLY
)

#### Tagging Catalogs

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
CATALOG_NAME = "tagging_test"
TAGS_TO_APPLY = {
    "role": "data_science",
    "req": "4356",
    "project": "dipt",
    "env": "prod"
}


def tag_catalog(catalog_name, tags_dict):
    """
    Apply tags to a catalog in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # SQL command to set tags
    sql_command = f"ALTER CATALOG {catalog_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to catalog {catalog_name}")

# Call the function with the parameters defined at the top
tag_catalog(
    catalog_name=CATALOG_NAME,
    tags_dict=TAGS_TO_APPLY
)

#### Removing Tags

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
SECURABLE_TYPE = "TABLE"  # Options: 'CATALOG', 'SCHEMA', 'TABLE', or 'COLUMN'
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"  # Set to None if not applicable
TABLE_NAME = "diamonds"    # Set to None if not applicable
COLUMN_NAME = "clarity"   # Set to None if not applicable
TAG_KEYS = ["project", "env"]  # Set to None to remove all tags

def remove_tags(securable_type, catalog_name, schema_name=None, table_name=None, 
               column_name=None, tag_keys=None):
    """
    Remove tags from a securable object.
    
    Parameters:
    - securable_type: Type of object ('CATALOG', 'SCHEMA', 'TABLE', or 'COLUMN')
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema (if applicable)
    - table_name: Name of the table (if applicable)
    - column_name: Name of the column (if applicable)
    - tag_keys: List of tag keys to remove (if None, all tags are removed)
    """
    spark = SparkSession.builder.getOrCreate()
    
    if securable_type == 'CATALOG':
        object_name = catalog_name
        sql_prefix = f"ALTER CATALOG {object_name}"
    elif securable_type == 'SCHEMA':
        object_name = f"{catalog_name}.{schema_name}"
        sql_prefix = f"ALTER SCHEMA {object_name}"
    elif securable_type == 'TABLE':
        object_name = f"{catalog_name}.{schema_name}.{table_name}"
        sql_prefix = f"ALTER TABLE {object_name}"
    elif securable_type == 'COLUMN':
        object_name = f"{catalog_name}.{schema_name}.{table_name}.{column_name}"
        sql_prefix = f"ALTER TABLE {catalog_name}.{schema_name}.{table_name} ALTER COLUMN {column_name}"
    else:
        raise ValueError("Invalid securable_type. Must be 'CATALOG', 'SCHEMA', 'TABLE', or 'COLUMN'")
    
    # If tag_keys is provided, only remove those specific tags
    if tag_keys:
        tag_keys_sql = ", ".join([f"'{key}'" for key in tag_keys])
        sql_command = f"{sql_prefix} UNSET TAGS ({tag_keys_sql})"
    else:
        # Get all existing tags for the object and remove them
        if securable_type == 'CATALOG':
            all_tags = spark.sql(f"SELECT tag_name FROM system.information_schema.catalog_tags WHERE catalog_name = '{catalog_name}'")
        elif securable_type == 'SCHEMA':
            all_tags = spark.sql(f"SELECT tag_name FROM system.information_schema.schema_tags WHERE catalog_name = '{catalog_name}' AND schema_name = '{schema_name}'")
        elif securable_type == 'TABLE':
            all_tags = spark.sql(f"SELECT tag_name FROM system.information_schema.table_tags WHERE catalog_name = '{catalog_name}' AND schema_name = '{schema_name}' AND table_name = '{table_name}'")
        elif securable_type == 'COLUMN':
            all_tags = spark.sql(f"SELECT tag_name FROM system.information_schema.column_tags WHERE catalog_name = '{catalog_name}' AND schema_name = '{schema_name}' AND table_name = '{table_name}' AND column_name = '{column_name}'")
        
        tag_keys = [row.tag_name for row in all_tags.collect()]
        if tag_keys:
            tag_keys_sql = ", ".join([f"'{key}'" for key in tag_keys])
            sql_command = f"{sql_prefix} UNSET TAGS ({tag_keys_sql})"
        else:
            print(f"No tags found for {object_name}")
            return
    
    # Execute the command
    spark.sql(sql_command)
    print(f"Successfully removed tags from {object_name}")

# Call the function with the parameters defined at the top
remove_tags(
    securable_type=SECURABLE_TYPE,
    catalog_name=CATALOG_NAME,
    schema_name=SCHEMA_NAME,
    table_name=TABLE_NAME,
    column_name=COLUMN_NAME,
    tag_keys=TAG_KEYS
)

## Automated Tagging Strategies
#### Automated Table and Column Tagging Based on Content

This example shows how to automatically scan tables for sensitive data patterns and apply appropriate tags:

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import re

# Parameters - set these variables at the top of your notebook cell
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"
TABLE_NAME = "diamonds"

# PII patterns configuration - customize as needed
PII_PATTERNS = {
    "EMAIL": r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
    "SSN": r'^\d{3}-\d{2}-\d{4}$',
    "CREDIT_CARD": r'^\d{4}-\d{4}-\d{4}-\d{4}$',
    "PHONE_NUMBER": r'^\d{3}-\d{3}-\d{4}$'
}

# PII name indicators configuration - customize as needed
PII_NAME_INDICATORS = {
    "EMAIL": ["email", "e-mail", "mail"],
    "SSN": ["ssn", "social", "security"],
    "CREDIT_CARD": ["cc", "credit", "card", "payment"],
    "PHONE": ["phone", "mobile", "cell", "tel"],
    "ADDRESS": ["address", "addr", "street"],
    "NAME": ["name", "firstname", "lastname", "fullname"]
}

# Function to tag table columns
def tag_table_column(catalog_name, schema_name, table_name, column_name, tags_dict):
    """
    Apply tags to a specific column in a table.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema  
    - table_name: Name of the table
    - column_name: Name of the column to tag
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set column tags
    sql_command = f"ALTER TABLE {full_table_name} ALTER COLUMN {column_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to column {column_name} in {full_table_name}")

# Function to tag tables
def tag_table(catalog_name, schema_name, table_name, tags_dict):
    """
    Apply tags to a table in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - table_name: Name of the table
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set tags
    sql_command = f"ALTER TABLE {full_table_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to {full_table_name}")

def scan_and_tag_pii_columns(catalog_name, schema_name, table_name, pii_patterns, pii_name_indicators):
    """
    Scan a table for potential PII columns and apply appropriate tags.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - table_name: Name of the table
    - pii_patterns: Dictionary of regex patterns for PII detection
    - pii_name_indicators: Dictionary of keywords that suggest PII in column names
    """
    spark = SparkSession.builder.getOrCreate()
    
    # Get table schema
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    df = spark.table(full_table_name)
    
    # For each column, check for PII patterns
    for column in df.columns:
        col_data = df.select(col(column)).limit(100)  # Sample first 100 rows
        
        # Skip if column is not string type
        if str(df.schema[column].dataType) not in ['StringType', 'VarcharType', 'CharType']:
            continue
            
        detected_tags = {}
        
        # Check column name for PII indicators
        for pii_type, indicators in pii_name_indicators.items():
            if any(indicator in column.lower() for indicator in indicators):
                detected_tags["PII_TYPE"] = pii_type
                detected_tags["SENSITIVITY"] = "HIGH"
                break
                
        # Check sample data for PII patterns
        if not detected_tags:
            for pii_type, pattern in pii_patterns.items():
                # Convert column to string and check if any values match the pattern
                matches = col_data.filter(col(column).rlike(pattern)).count()
                if matches > 0:
                    detected_tags["PII_TYPE"] = pii_type
                    detected_tags["SENSITIVITY"] = "HIGH"
                    break
        
        # Apply tags if PII was detected
        if detected_tags:
            tag_table_column(catalog_name, schema_name, table_name, column, detected_tags)
            print(f"Tagged column {column} in {full_table_name} as {detected_tags}")
    
    # Apply table-level tags based on column findings
    table_tags = {}
    
    # Query information schema to see what PII was found
    query = f"""
    SELECT tag_name, tag_value 
    FROM system.information_schema.column_tags 
    WHERE catalog_name = '{catalog_name}' 
    AND schema_name = '{schema_name}' 
    AND table_name = '{table_name}'
    AND tag_name = 'PII_TYPE'
    """
    
    pii_columns = spark.sql(query).collect()
    
    if pii_columns:
        table_tags["CONTAINS_PII"] = "TRUE"
        table_tags["GOVERNANCE_LEVEL"] = "RESTRICTED"
        
        # Apply the table-level tags
        tag_table(catalog_name, schema_name, table_name, table_tags)
        print(f"Tagged table {full_table_name} as {table_tags}")

# Call the function with the parameters defined at the top
scan_and_tag_pii_columns(
    catalog_name=CATALOG_NAME,
    schema_name=SCHEMA_NAME,
    table_name=TABLE_NAME,
    pii_patterns=PII_PATTERNS,
    pii_name_indicators=PII_NAME_INDICATORS
)

#### Bulk Tagging Using Configuration Files

This example shows how to apply tags to multiple objects using a YAML configuration file:

In [0]:
import yaml
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
CONFIG_FILE_PATH = "tags_config.yaml"

# Sample YAML configuration content - can be used to create a config file
SAMPLE_CONFIG = """
catalogs:
  - name: main
    tags:
      Environment: Production
      Department: Engineering
      
schemas:
  - catalog: main
    name: sales
    tags:
      DataDomain: Revenue
      Owner: Finance
      
tables:
  - catalog: main
    schema: sales
    name: transactions
    tags:
      UpdateFrequency: Daily
      Retention: 5Years
      BusinessCritical: True
    columns:
      - name: customer_id
        tags:
          PII_TYPE: IDENTIFIER
          JOIN_KEY: True
      - name: email
        tags:
          PII_TYPE: EMAIL
          SENSITIVITY: HIGH
"""

# Helper functions for tagging different objects
def tag_catalog(catalog_name, tags_dict):
    """
    Apply tags to a catalog in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # SQL command to set tags
    sql_command = f"ALTER CATALOG {catalog_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to catalog {catalog_name}")

def tag_schema(catalog_name, schema_name, tags_dict):
    """
    Apply tags to a schema in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full schema reference
    full_schema_name = f"{catalog_name}.{schema_name}"
    
    # SQL command to set tags
    sql_command = f"ALTER SCHEMA {full_schema_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to schema {full_schema_name}")

def tag_table(catalog_name, schema_name, table_name, tags_dict):
    """
    Apply tags to a table in Unity Catalog.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema
    - table_name: Name of the table
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set tags
    sql_command = f"ALTER TABLE {full_table_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to {full_table_name}")

def tag_table_column(catalog_name, schema_name, table_name, column_name, tags_dict):
    """
    Apply tags to a specific column in a table.
    
    Parameters:
    - catalog_name: Name of the catalog
    - schema_name: Name of the schema  
    - table_name: Name of the table
    - column_name: Name of the column to tag
    - tags_dict: Dictionary of tag key-value pairs
    """
    # Convert tags dictionary to SQL format
    tags_sql = ", ".join([f"'{k}' = '{v}'" for k, v in tags_dict.items()])
    
    # Full table reference
    full_table_name = f"{catalog_name}.{schema_name}.{table_name}"
    
    # SQL command to set column tags
    sql_command = f"ALTER TABLE {full_table_name} ALTER COLUMN {column_name} SET TAGS ({tags_sql})"
    
    # Execute the SQL command
    spark = SparkSession.builder.getOrCreate()
    spark.sql(sql_command)
    
    print(f"Successfully applied tags to column {column_name} in {full_table_name}")

def apply_tags_from_config(config_file_path):
    """
    Apply tags to multiple objects from a YAML configuration file.
    
    Parameters:
    - config_file_path: Path to the YAML configuration file
    """
    # Load configuration
    with open(config_file_path, 'r') as file:
        config = yaml.safe_load(file)
    
    spark = SparkSession.builder.getOrCreate()
    
    # Process each object type
    for object_type, objects in config.items():
        if object_type == 'catalogs':
            for catalog in objects:
                name = catalog['name']
                tags = catalog.get('tags', {})
                if tags:
                    tag_catalog(name, tags)
        
        elif object_type == 'schemas':
            for schema in objects:
                catalog = schema['catalog']
                name = schema['name']
                tags = schema.get('tags', {})
                if tags:
                    tag_schema(catalog, name, tags)
        
        elif object_type == 'tables':
            for table in objects:
                catalog = table['catalog']
                schema = table['schema']
                name = table['name']
                tags = table.get('tags', {})
                if tags:
                    tag_table(catalog, schema, name, tags)
                
                # Process column tags if present
                columns = table.get('columns', [])
                for column in columns:
                    column_name = column['name']
                    column_tags = column.get('tags', {})
                    if column_tags:
                        tag_table_column(catalog, schema, name, column_name, column_tags)

# Create a sample config file (optional)
def create_sample_config(file_path):
    """Create a sample YAML configuration file"""
    with open(file_path, 'w') as file:
        file.write(SAMPLE_CONFIG)
    print(f"Sample configuration file created at {file_path}")

# Uncomment to create a sample config file
create_sample_config(CONFIG_FILE_PATH)

# Call the function with the parameter defined at the top
apply_tags_from_config(config_file_path=CONFIG_FILE_PATH)

## Serverless Compute Workload Tagging

To attribute serverless compute usage, Databricks uses serverless budget policies. This feature is in Public Preview and allows you to tag serverless notebooks, jobs, pipelines, and model serving endpoints.

Since this requires administrative setup through the Databricks account console, we'll describe the general approach:

- Administrator creates serverless budget policies with custom tags
- Users or user groups are assigned to these policies
- Any serverless usage by these users is automatically tagged with the policy's custom tags

While this can't be directly implemented with Python code, you can query the tagged usage data:

In [0]:
from pyspark.sql import SparkSession

# Parameters - set these variables at the top of your notebook cell
START_DATE = "2025-01-01"  # Format: YYYY-MM-DD
END_DATE = "2025-01-31"    # Format: YYYY-MM-DD
FILTER_TAGS = {
    "role": "data_science",
    "req": "4356",
}  # Set to None to query all serverless usage

# Catalog, schema, and table for your sample data 
CATALOG_NAME = "tagging_test"
SCHEMA_NAME = "tagging_tables"
TABLE_NAME = "diamonds"

def query_tagged_serverless_usage(start_date, end_date, filter_tags=None):
    """
    Query serverless usage data with specific tags.
    
    Parameters:
    - start_date: Start date for usage query (YYYY-MM-DD)
    - end_date: End date for usage query (YYYY-MM-DD)
    - filter_tags: Dictionary of tags to filter by (optional)
    """
    spark = SparkSession.builder.getOrCreate()
    
    # Build query for the billable usage system table
    query = f"""
    SELECT 
        workspace_id,
        record_id,
        usage_date,
        sku_name,
        usage_type,
        usage_quantity,
        custom_tags
    FROM 
        system.billing.usage  -- or whatever the correct table name is
    WHERE 
        usage_date BETWEEN '{start_date}' AND '{end_date}'
        AND usage_type = 'COMPUTE_TIME'
"""
    
    # Add tag filters if provided
    if filter_tags:
        for key, value in filter_tags.items():
            query += f" AND custom_tags['{key}'] = '{value}'"
    
    # Execute query
    usage_data = spark.sql(query)
    
    return usage_data

# Function to view sample data
def view_sample_data():
    """View the sample diamonds data to verify it exists"""
    spark = SparkSession.builder.getOrCreate()
    full_table_name = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}"
    
    try:
        df = spark.table(full_table_name)
        print(f"Sample data from {full_table_name}:")
        df.show(5)
        return df
    except Exception as e:
        print(f"Error accessing table {full_table_name}: {str(e)}")
        return None

# Call the function with the parameters defined at the top
usage_data = query_tagged_serverless_usage(
    start_date=START_DATE,
    end_date=END_DATE,
    filter_tags=FILTER_TAGS
)

# Uncomment to display results
# display(usage_data)

# Uncomment to view your sample diamonds data
# view_sample_data()