# **PHASE 2: DATA ENGINEERING (Days 5-8)**

## **DAY 8 (16/01/26) - Unity Catalog Governance**



### **Section 1 - Learn**:

### **_1. Catalog → Schema → Table hierarchy_**
In Databricks **Unity Catalog**, data is organized using a **three-tier namespace**. This structure allows you to govern and find your data across the entire organization rather than just within a single workspace.

##### **The 3-Tier Hierarchy**

1. **Catalog (Tier 1):** The highest level of isolation.
* Think of this as a "Data Domain" (e.g., `Sales`, `Marketing`, or `Dev`).
* Permissions are often managed at this level (e.g., "The Finance team owns the `Finance` catalog").


2. **Schema / Database (Tier 2):** A grouping within a catalog.
* This is synonymous with a "Database" in traditional SQL.
* In a Medallion architecture, you might have schemas named `bronze`, `silver`, and `gold` inside a single catalog.


3. **Table / View (Tier 3):** The actual data object.
* This is where your rows and columns live. It can be a physical **Table** (managed or external) or a virtual **View**.



##### **How to Query Data**

When working in a notebook, you reference data using "Dot Notation":
`SELECT * FROM catalog_name.schema_name.table_name`

##### **Hierarchy Components & Governance**

| Level | Role | Example |
| --- | --- | --- |
| **Catalog** | Organizational Unit / Environment | `production`, `sandbox` |
| **Schema** | Functional Grouping | `raw_billing`, `refined_users` |
| **Table** | Specific Dataset | `invoice_details`, `active_subscriptions` |


##### **Key Benefits of this Hierarchy**

* **Centralized Governance:** You can grant a user access to an entire **Catalog**, and they automatically get access to every **Schema** and **Table** within it (Inheritance).
* **Searchability:** The Databricks **Catalog Explorer** allows you to search for data across the entire 3-tier namespace, regardless of which workspace you are in.
* **Data Lineage:** Unity Catalog tracks how data flows from a table in the `bronze` schema to a table in the `gold` schema, providing a visual map of your data's journey.
* **Standardization:** It forces teams to stop using "magic paths" (like `s3://my-bucket/data/file.parquet`) and instead use clean, human-readable SQL names.


##### **Pro-Tip: Setting the Context**

To avoid typing the full 3-part name every time, you can set your session context:

```sql
USE CATALOG production;
USE SCHEMA gold;
SELECT * FROM sales_report; -- Automatically looks in production.gold

```

---

### **_2. Access control (GRANT/REVOKE)_**

In Databricks **Unity Catalog**, access control follows a standard SQL syntax. Because of the **3-tier hierarchy**, permissions are hierarchical: if you grant someone access to a Catalog, they automatically inherit access to all Schemas and Tables within it.

##### **1. The Hierarchy of Privileges**

Access "trickles down" from the top:

* **Account Admin:** Can manage all catalogs and users.
* **Catalog Admin:** Owns a specific catalog and can create schemas.
* **Schema Owner:** Can create tables and views within that schema.
* **Table Owner:** Has full control over a specific table.

##### **2. Common SQL Commands**

To give a user or group permission to see data, you typically need to grant two things: **USAGE** (the ability to "enter" the container) and **SELECT** (the ability to "read" the data).

##### **Granting Read Access**

```sql
-- 1. Allow user to "see" the catalog
GRANT USAGE ON CATALOG main TO `data_analysts`;

-- 2. Allow user to "see" the schema
GRANT USAGE ON SCHEMA main.marketing TO `data_analysts`;

-- 3. Allow user to "read" the specific table
GRANT SELECT ON TABLE main.marketing.customer_leads TO `data_analysts`;

```

##### **Granting Write/Modify Access**

If a Data Engineer needs to update a table, you would grant more powerful permissions:

```sql
GRANT MODIFY, SELECT ON TABLE main.marketing.customer_leads TO `data_engineers`;

```

##### **Revoking Access**

If a team no longer needs access, the syntax is straightforward:

```sql
REVOKE SELECT ON TABLE main.marketing.customer_leads FROM `data_analysts`;

```


##### **3. Key Privilege Types**

| Privilege | What it allows |
| --- | --- |
| **`USAGE`** | Required to "browse" a Catalog or Schema. Does not give access to data. |
| **`SELECT`** | Allows reading data from a Table or View. |
| **`MODIFY`** | Allows `INSERT`, `UPDATE`, `DELETE`, and `MERGE` operations. |
| **`CREATE`** | Allows a user to create new Schemas (on Catalog) or Tables (on Schema). |
| **`ALL PRIVILEGES`** | Grants everything (Admin level) for that specific object. |


##### **4. Best Practices for Access Control**

* **Grant to Groups, Not Users:** Always create groups (e.g., `finance_team`, `data_scientists`) in the Databricks Admin Console and grant permissions to the group. This makes it easier to manage when people join or leave the company.
* **The Principle of Least Privilege:** Start by granting only `USAGE` and `SELECT`. Only grant `MODIFY` to automated Service Principals or Data Engineers.
* **Use Views for Masking:** If a table contains sensitive data (like SSNs), don't grant access to the table. Instead, create a **View** that hides the sensitive columns and grant `SELECT` on that View instead.
* **Check Permissions:** You can always see who has access to an object by running:
`SHOW GRANTS ON TABLE main.marketing.customer_leads;`



---

### **_3. Data lineage_**
In Databricks **Unity Catalog**, **Data Lineage** is a built-in feature that automatically tracks the flow of data from its source to its ultimate destination. It provides a visual map of how tables are related, helping you understand the "ancestry" of your data.

##### **1. How It Works**

Unlike manual documentation, Unity Catalog captures lineage **automatically** at runtime. Whenever you run a Spark query or a SQL command that reads from one table and writes to another, Databricks records that relationship.

* **Table-Level Lineage:** Shows how tables depend on each other (e.g., `Bronze_Orders` flows into `Silver_Orders`).
* **Column-Level Lineage:** Shows exactly which source columns were used to calculate a specific output column (e.g., `Revenue` in Gold is calculated from `Price` and `Quantity` in Silver).
* **Notebook & Workflow Lineage:** Identifies which specific notebook or job was responsible for the data movement.

##### **2. Why Data Lineage Matters**

* **Impact Analysis:** If you plan to change a column name in a "Silver" table, you can look downstream to see exactly which "Gold" dashboards or ML models will break.
* **Trust & Audit:** If an executive questions a number on a dashboard, a Data Engineer can trace it back through the layers to the original raw source file to verify its accuracy.
* **Compliance (GDPR/CCPA):** If you need to track where a user's PII (Personally Identifiable Information) is stored, lineage helps you find every table that has touched that sensitive data.
* **Debugging:** When a pipeline fails, lineage helps you quickly identify the "upstream" parent table that might contain the bad data causing the error.

##### **3. Key Visual Components**

In the **Catalog Explorer**, the Lineage tab provides three views:

1. **Upstream:** What tables/files did this data come from?
2. **Downstream:** What tables/dashboards/models use this data?
3. **Lineage Graph:** A full interactive map that you can expand to see the entire "life" of the data across catalogs and schemas.

##### **4. Limitations to Keep in Mind**

* **Unity Catalog Requirement:** Lineage only works for tables and views registered within Unity Catalog. It cannot track "legacy" Hive Metastore tables or raw files outside of UC management.
* **Read-Only via UI:** While you can see lineage in the UI, you can also query it via the **System Tables** (e.g., `system.access.table_lineage`) if you want to build your own custom reporting on data flow.
* **Real-time vs. Batch:** Lineage is updated as soon as a job completes, but it reflects the *last successful run*.

##### **Example Use Case**

Imagine a `total_spend` column in your **Gold** table is suddenly showing negative values. By opening the **Lineage Graph**, you can trace the column back to a `discount_applied` field in the **Silver** layer, and eventually to a raw JSON field in **Bronze**, helping you find the exact point where the calculation logic or source data went wrong.

---

### **_4. Managed vs external tables_**

In Databricks Unity Catalog, the distinction between **Managed** and **External** tables defines who controls the data's lifecycle and where that data physically lives on your cloud storage (S3, ADLS, or GCS).

##### **1. Managed Tables**

Managed tables are the "default" way to create tables in Unity Catalog. Databricks handles both the **metadata** (the table definition) and the **physical data files**.

* **Storage Location:** Data is stored in a central "Managed Storage" location defined at the Catalog or Schema level.
* **Lifecycle Control:** If you run `DROP TABLE`, Databricks deletes **both** the table definition in the catalog and the physical data files from the cloud storage.
* **Ease of Use:** You don't need to provide a path. You simply write `CREATE TABLE name AS SELECT...` and Databricks handles the rest.
* **Best For:** Most standard Medallion architecture layers (Silver and Gold) where Databricks is the primary tool interacting with the data.


##### **2. External Tables**

External tables (also called "Unmanaged tables") allow you to point Databricks at data that already exists in a specific cloud storage location.

* **Storage Location:** You must provide a specific path (e.g., `LOCATION 's3://my-bucket/data/orders'`).
* **Lifecycle Control:** If you run `DROP TABLE`, Databricks **only** deletes the metadata (the table name in the catalog). The **physical files stay exactly where they are**.
* **Sharing:** These are ideal if other tools (like Snowflake, Presto, or a legacy app) need to read the same Parquet/Delta files.
* **Best For:** The **Bronze** layer, where you are ingesting raw data from external systems, or when you need to maintain control over the folder structure for non-Databricks tools.


##### **Key Comparison Table**

| Feature | Managed Table | External Table |
| --- | --- | --- |
| **`DROP TABLE` behavior** | Deletes Metadata + Physical Data | Deletes Metadata ONLY |
| **Location** | Fixed (Managed Storage) | Flexible (User-defined `LOCATION`) |
| **Data Governance** | Full Unity Catalog control | Shared with external systems |
| **Syntax** | `CREATE TABLE table_name...` | `CREATE TABLE table_name... LOCATION 'path'` |
| **Performance** | Optimized by Databricks | Varies based on storage layout |


##### **How to choose?**

1. **Use Managed Tables** whenever possible. They are cleaner to manage, and Unity Catalog can perform better optimizations when it has full control over the file layout.
2. **Use External Tables** only if you have a specific requirement to keep the files in a particular folder for other applications, or if you are "registering" data that was written by a system outside of Databricks.

---

### **Practice**

In [0]:
%sql
CREATE CATALOG ecommerce ;
USE CATALOG ecommerce;
CREATE SCHEMA bronze;
CREATE SCHEMA silver;
CREATE SCHEMA gold;

---

In [0]:
%sql
-- Permissions
GRANT SELECT ON TABLE gold.products TO `analysts@company.com`;
GRANT ALL PRIVILEGES ON SCHEMA silver TO `engineers@company.com`;

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-5531138585135986>, line 1[0m
[0;32m----> 1[0m get_ipython()[38;5;241m.[39mrun_cell_magic([38;5;124m'[39m[38;5;124msql[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124m'[39m, [38;5;124m'[39m[38;5;124m-- Permissions[39m[38;5;130;01m\n[39;00m[38;5;124mGRANT SELECT ON TABLE gold.products TO `analysts@company.com`;[39m[38;5;130;01m\n[39;00m[38;5;124mGRANT ALL PRIVILEGES ON SCHEMA silver TO `engineers@company.com`;[39m[38;5;130;01m\n[39;00m[38;5;124m'[39m)

File [0;32m/databricks/python/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2541[0m, in [0;36mInteractiveShell.run_cell_magic[0;34m(self, magic_name, line, cell)[0m
[1;32m   2539[0m [38;5;28;01mwith[39;00m [38;5;28mself[39m[38;5;241m.[39mbuiltin_trap:
[1;32m   2540[0m     args [38;5;241m=

In [0]:
%sql
-- Controlled view
CREATE VIEW gold.top_products AS
SELECT product_name, revenue, conversion_rate
FROM gold.products
WHERE purchases > 10
ORDER BY revenue DESC LIMIT 100;

---

### **Resources**
- [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/)
- [Data governance](https://docs.databricks.com/data-governance/unity-catalog/get-started.html)

----