<img src="./images/logo.svg" alt="lakeFS logo" width=300/> 

# Searching object's user metadata/labels in lakeFS

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "metadata-search-repo"

### Versioning Information

In [4]:
mainBranch = "main"
ingestionBranch = "ingestion_branch"

### Import libraries

In [5]:
%xmode Minimal
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff, lakefs_ui_endpoint
import yaml

Exception reporting mode: Minimal


### Set environment variables

In [6]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Verify lakeFS credentials by getting lakeFS version

In [7]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version dev


### Define lakeFS Repository

In [8]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

{'id': 'metadata-search-repo', 'creation_date': 1741803285, 'default_branch': 'main', 'storage_namespace': 's3://example/metadata-search-repo'}


---

# Main demo starts here 🚦 👇🏻

## Setup and Configure Hooks

### Configure hooks in the repository
* Upload [Hooks config YAML file](./hooks/post-commit-update-object-metadata.yaml) for updating object's user metadata after data is committed
* Hooks config file must be uploaded to "_lakefs_actions" prefix

In [9]:
hooks_config_yaml = "post-commit-update-object-metadata.yaml"
hooks_prefix = "_lakefs_actions"

contentToUpload = open(f'./hooks/{hooks_config_yaml}', 'r').read()
print(branchMain.object(f'{hooks_prefix}/{hooks_config_yaml}').upload(data=contentToUpload, mode='wb', pre_sign=False))

_lakefs_actions/post-commit-update-object-metadata.yaml


### Upload Lua script

##### The script [update_object_metadata.lua](./hooks/update_object_metadata.lua) reads metadata/labels from the JSON files and updates user metadata for the objects in lakeFS

In [10]:
lua_script_file_name = "update_object_metadata.lua"
lua_scripts_path = "scripts"

contentToUpload = open(f'./hooks/{lua_script_file_name}', 'r').read()
print(branchMain.object(f'{lua_scripts_path}/{lua_script_file_name}').upload(data=contentToUpload, mode='wb', pre_sign=False))

scripts/update_object_metadata.lua


### Review a JSON file storing metadata
##### There is one JSON file for each image

In [11]:
!cat /data/stanfordogsdataset/Annotation_JSON/n02085620-Chihuahua/n02085620_199.json

{
	"annotation": {
		"folder": "02085620",
		"filename": "n02085620_199",
		"source": {
			"database": "ImageNet database"
		},
		"size": {
			"width": "300",
			"height": "430",
			"depth": "3"
		},
		"segment": "0",
		"object": {
			"name": "Chihuahua",
			"pose": "Unspecified",
			"truncated": "0",
			"difficult": "0",
			"bndbox": {
				"xmin": "65",
				"ymin": "50",
				"xmax": "249",
				"ymax": "404"
			}
		}
	}
}

### Commit changes to the lakeFS repo

In [12]:
ref = branchMain.commit(message='Added hooks config file and metadata validation scripts')
print_commit(ref.get_commit())

Message: Added hooks config file and metadata validation scripts
ID: dbd37f283da50fd50f28efed4d90f0e81e592147c36d047b2592416c8a6e2853
Committer: everything-bagel
Creation Date: 2025-03-12 18:15:35
Parents: ['254e4a3aab2e10eccf0089463bffeee3b4ff91778fd453be6dc81d211cd330a0']
Metadata:
{}


## Create a new branch which will be used to ingest data

In [14]:
branchIngestion = repo.branch(ingestionBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{ingestionBranch} ref:", branchIngestion.get_commit().id)

ingestion_branch ref: cf11333988220650353429e4de0b7b896e1fc18b9c0ef8d41abc06165df4589f


## Import images as well as annotations/metadata

### Configure the source/target paths

In [14]:
# Import Sources and Destinations
importSource1 = "s3://sample-data/stanfordogsdataset/Images" # e.g. "s3://sample-dog-images/Images/n02085620-Chihuahua/"
importSource2 = "s3://sample-data/stanfordogsdataset/Annotation_JSON" # e.g. "s3://sample-dog-images/Annotation/n02085620-Chihuahua/"
importDestination = "data/" # will keep the original files in the raw directory

### Do the import. Import process will commit the data and it will also invoke post-commit hook to update object's user metadata.

In [15]:
import time

importer = branchIngestion.import_data(commit_message="import objects", metadata={"key": "value"}) \
    .prefix(importSource1, destination=importDestination) \
    .prefix(importSource2, destination=importDestination)

importer.start()
time.sleep(2)
status = importer.status()
print(status)

while not status.completed and status.error is None:
    time.sleep(2)
    status = importer.status()
    print(status)

if status.error:
    raise Exception(status.error)
    
print(f"\nImported a total of {status.ingested_objects} objects into branch {ingestionBranch}")

{'completed': True, 'update_time': datetime.datetime(2025, 3, 12, 18, 15, 55, 665202, tzinfo=datetime.timezone.utc), 'ingested_objects': 1177, 'metarange_id': 'fef0d96e95193b29c67721339914761857420518a51cd6cebf98e6ff7a49fee7', 'commit': Commit(id="cf11333988220650353429e4de0b7b896e1fc18b9c0ef8d41abc06165df4589f"), 'error': None}

Imported a total of 1177 objects into branch ingestion_branch


## Commit metadata updated by the hook

In [15]:
ref = branchIngestion.commit(message='Updated Metadata')
print_commit(ref.get_commit())

Message: Updated Metadata
ID: 86ae71dff42030b564199d795c0a8b33f5a849a7748904e61714c66f002d765c
Committer: everything-bagel
Creation Date: 2025-03-13 16:30:46
Parents: ['cf11333988220650353429e4de0b7b896e1fc18b9c0ef8d41abc06165df4589f']
Metadata:
{}


## Merge data to main branch

In [16]:
branchIngestion.merge_into(branchMain)

'c1ca9831151c4b998a8e267402a29a82eb9936eb65c73e2035394986d4fd28cd'

# Metadata Search

#### Find all images with Chihuahua

In [11]:
branch = branchMain

for f in branch.objects(prefix='data'):
    if f.metadata and f.metadata['object.name'] == 'Chihuahua':
        print(f.path)

data/Images/n02085620-Chihuahua/n02085620_199.jpg
data/Images/n02085620-Chihuahua/n02085620_242.jpg
data/Images/n02085620-Chihuahua/n02085620_275.jpg
data/Images/n02085620-Chihuahua/n02085620_326.jpg
data/Images/n02085620-Chihuahua/n02085620_368.jpg


#### Find all images with width more than 400 pixel

In [12]:
branch = branchMain

for f in branch.objects(prefix='data'):
    if f.metadata and int(f.metadata['size.width']) > 400:
        print(f.path)
        print("width: " + f.metadata['size.width'])

data/Images/n02085620-Chihuahua/n02085620_275.jpg
width: 500
data/Images/n02085620-Chihuahua/n02085620_368.jpg
width: 500
data/Images/n02085782-Japanese_spaniel/n02085782_1039.jpg
width: 500
data/Images/n02085782-Japanese_spaniel/n02085782_1059.jpg
width: 550
data/Images/n02085782-Japanese_spaniel/n02085782_1077.jpg
width: 519
data/Images/n02086079-Pekinese/n02086079_1020.jpg
width: 500


## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack