
# 5. Data Governance

## Notebook Description

This notebook demonstrates the data governance capabilities of Unity Catalog in Databricks. It showcases how to document tables and columns, manage metadata, and improve data discoverability and compliance using AI-generated comments and structured schema information.

Sections:
- Access Control
- Column Masking
- Row Filter
- Data Lineage
- AI-generated comments

In [None]:
CATALOG = 'workspace'
BRONZE_SCHEMA = 'bronze'
SILVER_SCHEMA = 'silver'
GOLD_SCHEMA   = 'gold'

GROUP_NAME = 'analysts'


## Step 1. Access control

In the Lakehouse, you can use simple SQL GRANT and REVOKE statements to create granular (on data and even schema and catalog levels) access control irrespective of the data source or format.

To run this Access Control demo, you will need to create a `group`. You can go to your avatar at the top-right corner of the Databricks UI, click on it then navigate to "Settings" > "Identity and access" > "Groups" > click "Manage" > click "Add group" > "Add new" > Enter a name for the group.

*Note that in Databricks Free Edition, this Access control demo will not run well since you cannot create a group in Free Edition.*


In [0]:
# Let's grant our ANALYSTS a SELECT permission:
# Note: make sure you created an `analysts` group first.
display(spark.sql(f"GRANT SELECT ON TABLE {CATALOG}.{GOLD_SCHEMA}.applicants TO `{GROUP_NAME}`"))

# We'll grant an extra MODIFY to our Data Engineer
display(spark.sql(f"GRANT SELECT, MODIFY ON TABLE {CATALOG}.{GOLD_SCHEMA}.applicants TO `{GROUP_NAME}`"))

In [0]:
# Check the permissions for the analysts group
display(spark.sql(f"SHOW GRANTS ON TABLE {CATALOG}.{GOLD_SCHEMA}.applicants"))

In [0]:
# Now let's try removing grants
display(spark.sql(f"REVOKE SELECT, MODIFY ON TABLE {CATALOG}.{GOLD_SCHEMA}.applicants FROM `{GROUP_NAME}`"))

In [0]:
# Check that `analysts` group does not have any permissions anymore
display(spark.sql(f"SHOW GRANTS ON TABLE {CATALOG}.{GOLD_SCHEMA}.applicants"))


## Row and Column-level masking

There are 2 ways to do masking, using Dynamic View and Row Filter / Column Mask functions.

**Dynamic View masking** uses SQL logic (e.g., CASE statements) within views to redact or transform column values based on user/group membership. It is managed at the view level and is flexible for simple access scenarios.

**Row Filter / Column Mask Functions** leverages UDF function to filter rows or to mask column containing sensitive information. User can combine this with Unity Catalog attribute-based access control policy to centrally manage column masks using governed tags. Policies are applied at the catalog/schema/table level, enabling scalable, consistent, and auditable masking across multiple tables based on user attributes and data classification.

**Summary:** Dynamic Views are suitable for custom, table-specific masking, while Row Filter & Column Mask with ABAC Policy provide centralized, tag-driven governance for sensitive data masking.


## Step 2. Row/Column-level masking using Dynamic View

In the cells below we will demonstrate how to handle sensitive data through column masking using Dynamic View method.

Table: `{CATALOG}.{GOLD_SCHEMA}.applicants`

In [0]:
# Create a secured view for applicants by masking KTP number
display(spark.sql(f"""
CREATE OR REPLACE VIEW {CATALOG}.{SILVER_SCHEMA}.applicants_secured AS
SELECT
  c.* EXCEPT (ktp_number),
  CASE
    WHEN is_member('users')
      THEN '*********'
    ELSE CAST(c.ktp_number AS STRING)
  END AS ktp_number_protected
FROM
  {CATALOG}.{SILVER_SCHEMA}.applicants AS c;
"""))

In [0]:
# Check if the view really masks KTP number
display(spark.sql(f"""
SELECT
  applicant_id,
  nama_lengkap,
  jenis_kelamin,
  ktp_number_protected
FROM {CATALOG}.{SILVER_SCHEMA}.applicants_secured
LIMIT 5;
"""))


## Step 3. Row/Column-level masking using UDF Functions

In the cells below we will demonstrate how to handle sensitive data through column masking using UDF Functions.

_Note that this can be combined further with ABAC Policy to mask based on certain attributes but it is out of scope of this notebook for now_.

Table: `{CATALOG}.{GOLD_SCHEMA}.applicants`

In [0]:
# Create column mask functions
display(spark.sql(f"""
CREATE OR REPLACE FUNCTION ktp_mask(ktp_number STRING)
  RETURN 
    CASE 
      WHEN is_account_group_member('users') THEN ktp_number 
      ELSE '******' 
    END;
"""))

In [0]:
# Apply column mask functions to table
display(spark.sql(f"""
ALTER TABLE {CATALOG}.{SILVER_SCHEMA}.applicants ALTER COLUMN ktp_number SET MASK ktp_mask;
"""))

In [0]:
# Check if the view really masks KTP number
display(spark.sql(f"""
SELECT
  applicant_id,
  nama_lengkap,
  jenis_kelamin,
  ktp_number
FROM {CATALOG}.{SILVER_SCHEMA}.applicants
LIMIT 5;
"""))

In [0]:
# Clean up
display(spark.sql(f"ALTER TABLE {CATALOG}.{SILVER_SCHEMA}.applicants ALTER COLUMN ktp_number DROP MASK;"))
display(spark.sql(f"DROP FUNCTION ktp_mask;"))


## Step 4. Data Lineage

Lineage is critical for understanding compliance, audit, observability, but also discoverability of data.

These are three very common schenarios, where full data lineage becomes incredibly important:
1. **Explainability** - we need to have the means of tracing features used in machine learning to the raw data that created those features,
2. Tracing **missing values** in a dashboard or ML model to the origin,
3. **Finding specific data** - organizations have hundreds and even thousands of data tables and sources. Finiding the table or column that contains specific information can be daunting without a proper discoverability tools.

**Note**: To explore the lineage, navigate to the Catalog, and find the ```{CATALOG}.{GOLD_SCHEMA}.applicants``` table inside your catalog and schema, then click the `Lineage` tab.

## Step 5. AI-Generated Comments for Data Governance

Comments are useful for several purposes:
* For Data Discovery: Users can find tables and columns by description
* For Data Dictionary: Column and table meanings are clearly described thus providing context for users and AI alike.
* For Governance: Supports context for data lineage and documentation audit
* For Documentation: consistent semantics, meaning, duplicate data creation hence increasing quality of data

Aside from using SQL command, we can also generate comments and descriptions from Catalog Explorer.

**Table:** `{CATALOG}.{GOLD_SCHEMA}.loans`


In [0]:
# Add descriptive comments to each column
# These help users understand the data without reading documentation

display(spark.sql(f"""
ALTER TABLE {CATALOG}.{GOLD_SCHEMA}.loans ALTER COLUMN loan_id 
COMMENT 'Unique identifier for each loan record';
"""))

# Add comprehensive table description
display(spark.sql(f"""
COMMENT ON TABLE {CATALOG}.{GOLD_SCHEMA}.loans IS 
'Gold-layer table containing loan transaction records with terms, amounts, and default status. 
Used for credit risk analysis and loan portfolio management. 
Joins with applicants table via applicant_id.';
"""))

In [0]:
# Query to see all column comments
display(spark.sql(f"""
SELECT 
  column_name,
  data_type,
  comment
FROM {CATALOG}.information_schema.columns
WHERE 
  table_catalog = '{CATALOG}'
  AND table_schema = '{GOLD_SCHEMA}'
  AND table_name = 'loans'
ORDER BY ordinal_position
"""))