<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Data Lineage with lakeFS

**Use Case**: Understand data transformations by using commits with metadata and "Blame" functionality

In this example, data sets (employees & salaries) are ingested through two separated branches. Then, merged together on a transformation branch. And finally, promoted to the production branch.

At the very end of the process, the lakeFS "Blame" functionality (`log_commits`) is used to trace the origin of a specific file or dataset.

![](./images/data-lineage/CommitFlow.png)

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "data-lineage"

### Create lakeFSClient

In [None]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials‚Ä¶")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("üõë failed to get lakeFS version")
else:
    print(f"‚Ä¶‚úÖlakeFS credentials verified\n\n‚ÑπÔ∏èlakeFS version {v.version}")

### Define lakeFS Repository

In [None]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

### Set up Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakeFS / Jupyter") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

## Versioning Information

In [None]:
productionBranch = "main"
ingestionBranch1 = "ingest1"
ingestionBranch2 = "ingest2"
transformationBranch = "transformation"
newPath = "partitioned_data"
fileName = "Employees.csv"

---

# Main demo starts here üö¶ üëáüèª

## Ingest data into the first ingestion branch

In [None]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=ingestionBranch1,
        source=productionBranch))

In [None]:
import os
contentToUpload = open(f"/data/{fileName}", 'rb') # Only a single file per upload which must be named \\\"content\\\"
lakefs.objects.upload_object(
    repository=repo.id,
    branch=ingestionBranch1,
    path=fileName, content=contentToUpload)

## Commit changes to first ingest branch and attach some metadata

In [None]:
lakefs.commits.commit(
    repository=repo.id,
    branch=ingestionBranch1,
    commit_creation=CommitCreation(
        message='Ingesting employees IDs',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': 'Employees.csv'}))

## Ingest data into the second ingestion branch

In [None]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=ingestionBranch2,
        source=productionBranch))

In [None]:
fileName = "Salaries.csv"

import os
contentToUpload = open(f"/data/{fileName}", 'rb') # Only a single file per upload which must be named \\\"content\\\"
lakefs.objects.upload_object(
    repository=repo.id,
    branch=ingestionBranch2,
    path=fileName, content=contentToUpload)

## Commit changes to second ingest branch with metadata

In [None]:
lakefs.commits.commit(
    repository=repo.id,
    branch=ingestionBranch2,
    commit_creation=CommitCreation(
        message='Ingesting Salaries',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb',
                  'source': '/Salaries.csv'}))

## Merge the lists in a transformation branch

In [None]:
lakefs.branches.create_branch(
    repository=repo.id,
    branch_creation=BranchCreation(
        name=transformationBranch,
        source=productionBranch))

In [None]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=ingestionBranch1, 
    destination_branch=transformationBranch)

In [None]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=ingestionBranch2, 
    destination_branch=transformationBranch)

In [None]:
employeeFile="Employees.csv"
SalariesFile="Salaries.csv"

In [None]:
dataPath = f"s3a://{repo.id}/{transformationBranch}/{employeeFile}"

df1 = spark.read.option("header", "true").csv(dataPath)
df1.show()


In [None]:
dataPath = f"s3a://{repo.id}/{transformationBranch}/{SalariesFile}"

df2 = spark.read.option("header", "true").csv(dataPath)
df2.show()

In [None]:
mergedDataset = df1.join(df2,["id"])
mergedDataset.show()

## Partition by department

In [None]:
newDataPath = f"s3a://{repo.id}/{transformationBranch}/{newPath}"

mergedDataset.write.partitionBy("department").csv(newDataPath)

## Commit with metadata

In [None]:
lakefs.commits.commit(
    repository=repo.id,
    branch=transformationBranch,
    commit_creation=CommitCreation(
        message='Repartitioned by departments',
        metadata={'using': 'python_api',
                  '::lakefs::codeVersion::url[url:ui]': 'https://github.com/treeverse/lakeFS-samples/blob/668c7d000b8c603b3f30789a8c10616086ef79c1/08-data-lineage/Data%20Lineage.ipynb'}))

## Atomically promote data to Production

In [None]:
lakefs.refs.merge_into_branch(
    repository=repo.id,
    source_ref=transformationBranch, 
    destination_branch=productionBranch)

## Where did a dataset come from?

In [None]:
commits = lakefs.refs.log_commits(repository=repo.id, ref='main', amount=1, limit=True, prefixes=['partitioned_data/department=Engineering/'])
print(commits.results)

In [None]:
commits = lakefs.refs.log_commits(repository=repo.id, ref='main', amount=1, objects=['Employees.csv'])
print(commits.results)


----

----

In [None]:
os.environ

In [None]:
# The section below will only work on lakeFS cloud. 
# This cell will stop execution which is useful if the notebook has been 
# run from the top or is being run as part of automated testing.
import sys
print("ending notebook execution")
sys.exit(0)

----

# Auditing (lakeFS Cloud only)

## Setup

### Creating an Engineering group

In [None]:
lakefs.auth.create_group(
    group_creation=GroupCreation(
        id='Engineering'))

### Creating an engineer1 User

In [None]:
lakefs.auth.create_user(
    user_creation=UserCreation(
        id='engineer1'))

### Adding the engineer1 User to the group

In [None]:
lakefs.auth.add_group_membership(
    group_id='Engineering',
    user_id='engineer1')

## Generating credentials and setting up a client for the Engineer1 User

In [None]:
credentials = lakefs.auth.create_credentials(user_id='engineer1')
print(credentials)
engineer1AccessKey = credentials.access_key_id
engineer1SecretKey = credentials.secret_access_key

In [None]:
# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = engineer1AccessKey
configuration.password = engineer1SecretKey
configuration.host = lakefsEndPoint

# Creating a client for engineer1
engineer1Client = LakeFSClient(configuration)
print("Created lakeFS client for engineer1.")

## Providing Engineers with Full Access to the Filesystem

In [None]:
lakefs.auth.attach_policy_to_group(
    group_id='Engineering',
    policy_id='FSFullAccess')

## Engineer1 will now read the salary of Finance... 

In [None]:
engineer1Client.objects.list_objects(
    repository=repo.id,
    ref='main',
    prefix='partitioned_data/department=Finance/'
)