<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Using [Lua Hooks](https://docs.lakefs.io/howto/hooks/lua.html) in lakeFS (similar to GitHub Actions)

This notebook demonstrated how to create a pre-merge hook in lakeFS that validates the metadata before merging data into the production branch. 

1. Define hook configuration files and a Lua scripts for metadata validations. 
2. Perform an ETL process by creating an ingestion branch, uploading data files with metadata and atomically promoting the data to the production branch through a merge.
3. The pre-merge hook prevents the promotion due to metadata issues, resulting in a Precondition Failed error.
4. Attempt to change the metadata and promote it to production again. 

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "metadata-validation-example-repo"

### Versioning Information

In [None]:
mainBranch = "main"
ingestionBranch = "ingestion_branch"
fileName1 = "userdata1.parquet"
fileName2 = "userdata2.parquet"

### Import libraries

In [None]:
%xmode Minimal
import os
import lakefs
import lakefs_sdk
from lakefs_sdk.client import LakeFSClient
from lakefs_sdk import models
from assets.lakefs_demo import print_commit, print_diff, lakefs_ui_endpoint
import yaml

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Working with the lakeFS Python client API

In [None]:
configuration = lakefs_sdk.Configuration(
    host=lakefsEndPoint,
    username=lakefsAccessKey,
    password=lakefsSecretKey,
)
lakefsClient = LakeFSClient(configuration)

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

---

# Main demo starts here 🚦 👇🏻

## Setup and Configure Hooks

### Configure hooks in the repository

* Upload [Hooks config YAML file](./hooks/pre-merge-metadata-validation.yaml) for metadata validation to check for mandatory metadata before data is merged into the main branch
* Hooks config file must be uploaded to "_lakefs_actions" prefix

In [None]:
hooks_config_yaml = "pre-merge-metadata-validation.yaml"
hooks_prefix = "_lakefs_actions"

contentToUpload = open(f'./hooks/{hooks_config_yaml}', 'r').read()
print(branchMain.object(f'{hooks_prefix}/{hooks_config_yaml}').upload(data=contentToUpload, mode='wb', pre_sign=False))

### Upload 1st script

##### The script [commit_metadata_validator.lua](./hooks/commit_metadata_validator.lua) checks commit metadata to validate that mandatory metadata fields are present and value for the metadata fields match the required pattern

In [None]:
lua_script_file_name = "commit_metadata_validator.lua"
lua_scripts_path = "scripts"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

### Upload 2nd script

##### The script [dataset_validator.lua](./hooks/dataset_validator.lua) validates the existence of mandatory metadata describing a dataset

In [None]:
lua_script_file_name = "dataset_validator.lua"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

### Commit changes to the lakeFS repo

In [None]:
ref = branchMain.commit(message='Added hooks config file and metadata validation scripts')
print_commit(ref.get_commit())

### Protect main branch so no one can write directly to the main branch and any subsequent writes must be done via the merge of a branch

In [None]:
lakefsClient.repositories_api.set_branch_protection_rules(
    repository=repo_name,
    branch_protection_rule=[models.BranchProtectionRule(
        pattern=mainBranch)])

# ETL Job Starts

## Create a new branch which will be used to ingest data

In [None]:
branchIngestion = repo.branch(ingestionBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{ingestionBranch} ref:", branchIngestion.get_commit().id)

## Upload data files

In [None]:
obj = branchIngestion.object(path=f"datasets/{fileName1}")

with open(f"/data/userdata/{fileName1}", mode='rb') as reader, obj.writer(mode='wb') as writer:
    writer.write(reader.read())

In [None]:
obj = branchIngestion.object(path=f"datasets/{fileName2}")

with open(f"/data/userdata/{fileName2}", mode='rb') as reader, obj.writer(mode='wb') as writer:
    writer.write(reader.read())

## Upload metadata file

In [None]:
dataset_metadata_definition = {
   'contains_pii': 'yes',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

## Commit changes

In [None]:
ref = branchIngestion.commit(message='Added data and metadata files')
print_commit(ref.get_commit())

## Promote the Data into production

#### Merging the ingestion branch with the current metadata to the production branch
#### 🛑🛑 Merge will fail because 'spark_version' metadata key is missing in the merge metadata.  Review the error message.

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb'})
print(res)

#### Add 'spark_version' metadata and try to merge again.
#### 🛑🛑 Merge will fail again because metadata field 'notebook_url' does not match the pattern: 'github.com/.*'.

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.ai/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

#### Change 'github.ai' to 'github.com' in the value of 'notebook_url' metadata and try to merge again.
#### 🛑🛑 Merge will fail again because field 'contains_pii' in dataset_metadata.yaml file should be of type boolean.

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

#### Change value for the field 'contains_pii' in dataset_metadata.yaml file to 'True' and try to merge again.
#### 🛑🛑 Merge will fail again because field 'approval_link' is required in the dataset_metadata.yaml file.

In [None]:
dataset_metadata_definition = {
   'contains_pii': True,
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

#### Add field 'approval_link' in the dataset_metadata.yaml file and try to merge again.
#### 🛑🛑 Merge will fail again because value for field 'approval_link' should match the pattern 'https?:\\/\\/.*'.

In [None]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'example.com',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

#### Change value for the field 'approval_link' from 'example.com' to 'https://example.com' and try to merge again.
#### 🛑🛑 Merge will fail again because value for the field 'department' should be one of 'hr, it, other'.

In [None]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'https://example.com',
   'rank': 1,
   'department': 'finance'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

#### Change value for the field 'department' from 'finance' to 'hr' and try to merge again.
#### Merge will succeed this time.

In [None]:
dataset_metadata_definition = {
   'contains_pii': True,
   'approval_link': 'https://example.com',
   'rank': 1,
   'department': 'hr'
}

with branchIngestion.object(path='datasets/dataset_metadata.yaml').writer() as out:
   yaml.safe_dump(dataset_metadata_definition, out)

ref = branchIngestion.commit(message='Changed metadata file')
print_commit(ref.get_commit())

In [None]:
res = branchIngestion.merge_into(branchMain, 
        metadata={'notebook_url': 'https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/hooks-metadata-validation.ipynb',
                 'spark_version': '3.3.2'})
print(res)

## You can also review all Actions in lakeFS UI

In [None]:
lakefsUIEndPoint = lakefs_ui_endpoint(lakefsEndPoint)
print(f"👉🏻 {lakefsUIEndPoint}/repositories/{repo_name}/actions")

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack