# Show Lineage for Delta Tables in Unity Catalog

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/lineage/uc-lineage-slide.png?raw=true" style="float:right; margin-left:10px" width="700"/>

Unity Catalog captures runtime data lineage for any table to table operation executed on a Databricks cluster or SQL endpoint. Lineage operates across all languages (SQL, Python, Scala and R) and it can be visualized in the Data Explorer in near-real-time, and also retrieved via REST API.

Lineage is available at two granularity levels:
- Tables
- Columns: ideal to track GDPR dependencies

Lineage takes into account the Table ACLs present in Unity Catalog. If a user is not allowed to see a table at a certain point of the graph, its information are redacted, but they can still see that a upstream or downstream table is present.

## Working with Lineage

No modifications are needed to the existing code to generate the lineage. As long as you operate with tables saved in the Unity Catalog, Databricks will capture all lineage informations for you.

Requirements:
- Make sure you set `spark.databricks.dataLineage.enabled true`in your cluster setup
- Source and target tables must be registered in a Unity Catalog metastore to be eligible for lineage capture
- The data manipulation must be performed using Spark DataFrame language (python/SQL)
- To view lineage, users must have the SELECT privilege on the table

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=governance&org_id=3759185753378633&notebook=%2F00-UC-lineage&demo_name=uc-03-data-lineage&event=VIEW&path=%2F_dbdemos%2Fgovernance%2Fuc-03-data-lineage%2F00-UC-lineage&version=1">

### A cluster has been created for this demo
To run this demo, just select the cluster `dbdemos-uc-03-data-lineage-maynard` from the dropdown menu ([open cluster configuration](https://adb-3759185753378633.13.azuredatabricks.net/#setting/clusters/0531-105738-55eb9ir4/configuration)). <br />
*Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('uc-03-data-lineage')` or re-install the demo: `dbdemos.install('uc-03-data-lineage')`*

In [0]:
%run ./_resources/00-setup

## 1/ Create a Delta Table In Unity Catalog

The first step is to create a Delta Table in Unity Catalog.

We want to do that in SQL, to show multi-language support:

1. Use the `CREATE TABLE` command and define a schema
1. Use the `INSERT INTO` command to insert some rows in the table

In [0]:
%sql 
SELECT CURRENT_CATALOG()

In [0]:
%sql
CREATE TABLE IF NOT EXISTS menu (recipe_id INT, app string, main string, desert string);
DELETE from menu ;

INSERT INTO menu 
    (recipe_id, app, main, desert) 
VALUES 
    (1,"Ceviche", "Tacos", "Flan"),
    (2,"Tomato Soup", "Souffle", "Creme Brulee"),
    (3,"Chips","Grilled Cheese","Cheescake");

## 2/ Create a Delta Table from the Previously Created One

To show dependancies between tables, we create a new one `AS SELECT` from the previous one, concatenating three columns into a new one

In [0]:
%sql
CREATE TABLE IF NOT EXISTS dinner 
  AS SELECT recipe_id, concat(app," + ", main," + ",desert) as full_menu FROM menu

## 3/ Create a Delta Table as join from Two Other Tables

The last step is to create a third table as a join from the two previous ones. This time we will use Python instead of SQL.

- We create a Dataframe with some random data formatted according to two columns, `id` and `recipe_id`
- We save this Dataframe as a new table, `main.lineage.price`
- We read as two Dataframes the previous two tables, `main.lineage.dinner` and `main.lineage.price`
- We join them on `recipe_id` and save the result as a new Delta table `main.lineage.dinner_price`

In [0]:
df = spark.range(3).withColumn("price", F.round(10*F.rand(seed=42),2)).withColumnRenamed("id", "recipe_id")

df.write.mode("overwrite").saveAsTable("price")

dinner = spark.read.table("dinner")
price = spark.read.table("price")

dinner_price = dinner.join(price, on="recipe_id")
dinner_price.write.mode("overwrite").saveAsTable("dinner_price")


## 4/ Visualize Table Lineage

The Lineage can be visualized in the `Data Explorer` of the part of the Workspace dedicated to the `SQL Persona`.

1. Select the `Catalog`
1. Select the `Schema`
1. Select the `Table`
1. Select the `Lineage` tab on the right part of the page
1. You can visualize the full lineage by pressing the `See Lineage Graph` button
1. By default the graph is condensed. By clicking on the boxes you can expand them and visualize the full lineage.


<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/lineage/lineage-table.gif?raw=true"/>

## 5/ Visualize Column Lineage

The Lineage is alos available for the Column. This is very useful to track column dependencies and be able to find GDPR, including by API.

You can access the column lineage by clicking on any of the column name. In this case we see that the menu comes from 3 other columns of the menu table:
<br/><br/>


<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/uc/lineage/lineage-column.gif?raw=true"/>


## 6/ Lineage Permission Model

Lineage graphs share the same permission model as Unity Catalog. If a user does not have the SELECT privilege on the table, they will not be able to explore the lineage.

## Conclusion

Databricks Unity Catalog let you track data lineage out of the box.

No extra setup required, just read and write from your table and the engine will build the dependencies for you. Lineage can work at a table level but also at the column level, which provide a powerful tool to track dependencies on sensible data.

Lineage can also show you the potential impact updating a table/column and find who will be impacted downstream.


### Existing Limitations
- Streaming operations are not yet supported
- Lineage will not be captured when data is written directly to files in cloud storage even if a table is defined at that location (eg spark.write.save(“s3:/mybucket/mytable/”) will not produce lineage)
- Lineage is not captured across workspaces (eg if a table A > table B transformation is performed in workspace 1 and table B > table C in workspace 2, each workspace will show a partial view of the lineage for table B)
- Lineage is computed on a 90-day rolling window, meaning that lineage will not be displayed for tables that have not been modified in more than 90 days ago