<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Data Lineage with lakeFS

**Use Case**: Understand data transformations by using commits with metadata and "Blame" functionality

In this example, data sets (employees & salaries) are ingested through two separated branches. Then, merged together on a transformation branch. And finally, promoted to the production branch.

At the very end of the process, the lakeFS "Blame" functionality (`log_commits`) is used to trace the origin of a specific file or dataset.

![](./images/data-lineage/CommitFlow.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### lakeFS repository name

In [None]:
repo_name = "data-lineage"

### Versioning Information

In [None]:
productionBranch = "main"
ingestionBranch1 = "ingest1"
ingestionBranch2 = "ingest2"
transformationBranch = "transformation"
newPath = "partitioned_data"
fileName = "Employees.csv"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=productionBranch, exist_ok=True)
print(repo)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

# Main demo starts here 🚦 👇🏻

## Ingest data into the first ingestion branch

In [None]:
branchIngest1 = repo.branch(ingestionBranch1).create(source_reference=productionBranch, exist_ok=True)
print(f"{ingestionBranch1} ref:", branchIngest1.get_commit().id)

In [None]:
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchIngest1.object(fileName).upload(data=contentToUpload, mode='wb')

## Commit changes to first ingest branch and attach some metadata

In [None]:
ref = branchIngest1.commit(message='Ingesting employees IDs',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': 'Employees.csv'})
print_commit(ref.get_commit())

## Ingest data into the second ingestion branch

In [None]:
branchIngest2 = repo.branch(ingestionBranch2).create(source_reference=productionBranch, exist_ok=True)
print(f"{ingestionBranch2} ref:", branchIngest2.get_commit().id)

In [None]:
fileName = "Salaries.csv"

contentToUpload = open(f"/data/{fileName}", 'r').read()
branchIngest2.object(fileName).upload(data=contentToUpload, mode='wb')

## Commit changes to second ingest branch with metadata

In [None]:
ref = branchIngest2.commit(message='Ingesting Salaries',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': '/Salaries.csv'})
print_commit(ref.get_commit())

## Merge the lists in a transformation branch

In [None]:
branchTransformation = repo.branch(transformationBranch).create(source_reference=productionBranch, exist_ok=True)
print(f"{transformationBranch} ref:", branchTransformation.get_commit().id)

In [None]:
res = branchIngest1.merge_into(branchTransformation)
print(res)

In [None]:
res = branchIngest2.merge_into(branchTransformation)
print(res)

In [None]:
employeeFile="Employees.csv"
SalariesFile="Salaries.csv"

In [None]:
dataPath = f"s3a://{repo_name}/{transformationBranch}/{employeeFile}"

df1 = spark.read.option("header", "true").csv(dataPath)
df1.show()

In [None]:
dataPath = f"s3a://{repo_name}/{transformationBranch}/{SalariesFile}"

df2 = spark.read.option("header", "true").csv(dataPath)
df2.show()

In [None]:
mergedDataset = df1.join(df2,["id"])
mergedDataset.show()

## Partition by department

In [None]:
newDataPath = f"s3a://{repo_name}/{transformationBranch}/{newPath}"

mergedDataset.write.partitionBy("department").csv(newDataPath)

## Commit with metadata

In [None]:
ref = branchTransformation.commit(message='Repartitioned by departments',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb'})
print_commit(ref.get_commit())

## Atomically promote data to Production

In [None]:
branchProduction = repo.branch(productionBranch)
res = branchTransformation.merge_into(branchProduction)
print(res)

## Where did a dataset come from?

In [None]:
for log in lakefs.Reference(repository_id=repo_name, reference_id=productionBranch).log(max_amount=1, limit=True, prefixes=['partitioned_data/department=Engineering/']):
    print_commit(log)

In [None]:
for log in lakefs.Reference(repository_id=repo_name, reference_id=productionBranch).log(max_amount=1, objects=['Employees.csv']):
    print_commit(log)

----

----

In [None]:
# The section below will only work on lakeFS Cloud or lakeFS Enterprise. 
# This cell will stop execution which is useful if the notebook has been 
# run from the top or is being run as part of automated testing.
import sys
print("ending notebook execution")
sys.exit(0)

----

# Auditing (lakeFS Enterprise and lakeFS Cloud only)

## Setup

In [None]:
import lakefs_sdk
from lakefs_sdk.client import LakeFSClient
from lakefs_sdk.models import GroupCreation, UserCreation

In [None]:
if not 'lakefsClient' in locals():
    configuration = lakefs_sdk.Configuration(
        host=lakefsEndPoint,
        username=lakefsAccessKey,
        password=lakefsSecretKey,
    )
    lakefsClient = LakeFSClient(configuration)
    print("Created lakeFS client.")

### Creating an Engineering group

In [None]:
EngineeringGroup = lakefsClient.auth_api.create_group(
    group_creation=GroupCreation(
        id='Engineering'))

### Creating an engineer1 User

In [None]:
lakefsClient.auth_api.create_user(
    user_creation=UserCreation(
        id='engineer1'))

### Adding the engineer1 User to the group

In [None]:
lakefsClient.auth_api.add_group_membership(
    group_id=EngineeringGroup.id,
    user_id='engineer1')

## Generating credentials and setting up a client for the Engineer1 User

In [None]:
credentials = lakefsClient.auth_api.create_credentials(user_id='engineer1')
print(credentials)
engineer1AccessKey = credentials.access_key_id
engineer1SecretKey = credentials.secret_access_key

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=engineer1AccessKey,
    password=engineer1SecretKey,
)
engineer1Client = LakeFSClient(configuration)
print("Created lakeFS client for engineer1.")

## Providing Engineers with Full Access to the Filesystem

In [None]:
lakefsClient.auth_api.attach_policy_to_group(
    group_id=EngineeringGroup.id,
    policy_id='FSFullAccess')

## Engineer1 will now read the salary of Finance... 

In [None]:
engineer1Client.objects_api.list_objects(
    repository=repo_name,
    ref='main',
    prefix='partitioned_data/department=Finance/'
)