<img src="./images/logo.svg" alt="lakeFS logo" width=300/>

## lakeFS Metadata Search
Use Cases:
* Data Discovery & Exploration: Quickly find relevant data using flexible filters (e.g., annotations, object size, timestamps).
* Data Governance: Audit metadata tags, detect sensitive data (like PII), and ensure objects are properly labeled with ownership or classification to support internal policies and external compliance requirements.
* Operational Troubleshooting: Filter and inspect data using metadata like workflow ID or publish time to trace lineage, debug pipeline issues, and understand how data was created or modified - all within a specific lakeFS version.

[📚 lakeFS Metadata Search Docs](https://docs.lakefs.io/latest/datamanagment/metadata-search/)

## Prerequisites

###### This Notebook requires connecting to lakeFS Cloud or lakeFS Enterprise.
###### Register for the lakeFS Cloud: https://lakefs.cloud/register or Contact Us for a lakeFS Enterprise Key: https://lakefs.io/contact-sales/
###### 
###### [Metadata Search should be configured](https://docs.lakefs.io/latest/datamanagment/metadata-search/#configuration)

---

## Config

### Change lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'https://org_name.us-east-1.lakefscloud.io'
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Provide lakeFS Repo name for the Metadata repo
##### The naming convention for the metadata repository is repo-metadata, where repo is the data repository id e.g. quickstart-metadata

In [None]:
repo_name = "lakefs-samples-repo-metadata"

### Provide lakeFS branch name

In [None]:
branch_name = "main"

### Change Storage information

In [None]:
storage_endpoint = "s3.us-east-1.amazonaws.com"
storage_access_key = "aaaaaaaaaaaaaaaa"
storage_secret_key = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
storage_region = "us-east-1"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

### Install and import libraries

In [None]:
pip install duckdb
pip install pyiceberg==0.9.1

In [None]:
from pyiceberg.catalog import load_catalog
from pyiceberg.catalog.rest import RestCatalog

### Define Iceberg catalog

In [None]:
catalog = RestCatalog(
    name = "my_catalog",
    **{
    'prefix': 'lakefs',
    'uri': f'{lakefsEndPoint}/iceberg/api',
    'oauth2-server-uri': f'{lakefsEndPoint}/iceberg/api/v1/oauth/tokens',
    'credential': f'{lakefsAccessKey}:{lakefsSecretKey}',
    's3.endpoint': f'{storage_endpoint}',
    's3.access-key-id': f'{storage_access_key}',
    's3.secret-access-key': f'{storage_secret_key}',
    's3.region': f'{storage_region}',
    's3.force-virtual-addressing': False,
})

### Load metadata table

In [None]:
# `repo` is the repository name we would like to search 
con = catalog.load_table(f'{repo_name}.{branch_name}.system.object_metadata').scan().to_duckdb('object_metadata')

---

# Main demo starts here 🚦 👇🏻

### Search objects based on a string metadata key

In [None]:
metadata_key = 'source.database'
metadata_value = 'Airbus Ship Detection Challenge'

In [None]:
query = f"""
SELECT path, size_bytes
FROM object_metadata
WHERE user_metadata['{metadata_key}'] = '{metadata_value}'
"""

df = con.execute(query).df()
df

### Search objects based on a integer metadata key

In [None]:
metadata_key = 'size.width'
metadata_value = 300

In [None]:
query = f"""
SELECT path, size_bytes 
FROM object_metadata
WHERE CAST(user_metadata['{metadata_key}'] AS INTEGER) >= {metadata_value}
"""

df = con.execute(query).df()
df

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack