# Feast Feature Store Explorer with Spark Backend

This notebook demonstrates how to query and explore the Feast feature store configured with a Spark offline backend and Iceberg tables on LakeFS.

## Data Flow
```
dlt (Kaggle) ‚Üí Avro ‚Üí MinIO ‚Üí Spark ‚Üí Iceberg (LakeFS) ‚Üí Feast
```

## 1. Environment Setup

In [3]:
import os
import sys
from pathlib import Path

# Add project root to path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Set environment variables for local development (adjust as needed)
os.environ.setdefault("LAKEFS_ENDPOINT_URL", "http://localhost:8000")
os.environ.setdefault("LAKEFS_ACCESS_KEY_ID", "AKIAIOSFOLQUICKSTART")
os.environ.setdefault("LAKEFS_SECRET_ACCESS_KEY", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
os.environ.setdefault("LAKEFS_REPOSITORY", "kronodroid")
os.environ.setdefault("LAKEFS_BRANCH", "main")
os.environ.setdefault("REDIS_CONNECTION_STRING", "redis://localhost:16379")

print(f"Project root: {project_root}")
print(f"LakeFS endpoint: {os.environ['LAKEFS_ENDPOINT_URL']}")

Project root: /Users/benjaminbrown/Documents/GitHub/mlops
LakeFS endpoint: http://localhost:8000


## 2. Initialize Spark Session

Create a Spark session configured for Iceberg + LakeFS.

In [5]:
from engines.spark_engine.dfp_spark.session import get_spark_session, SparkConfig

# Create Spark session with Iceberg + LakeFS configuration
spark_config = SparkConfig(
    app_name="feast_explorer",
    driver_memory="2g",
    executor_memory="2g",
)

spark = get_spark_session(config=spark_config)
print(f"Spark version: {spark.version}")
print(f"Spark app name: {spark.sparkContext.appName}")

25/12/14 15:41:20 WARN Utils: Your hostname, Benjamins-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.0.197 instead (on interface en0)
25/12/14 15:41:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/benjaminbrown/.ivy2/cache
The jars for the packages stored in: /Users/benjaminbrown/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
org.apache.iceberg#iceberg-aws-bundle added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6dbedf92-0178-493d-8405-eea7bacdb1bd;1.0
	confs: [default]


:: loading settings :: url = jar:file:/Users/benjaminbrown/Documents/GitHub/mlops/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.5.0 in central
	found org.apache.iceberg#iceberg-aws-bundle;1.5.0 in central
	found org.apache.spark#spark-avro_2.12;3.5.0 in central
	found org.tukaani#xz;1.9 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.0/iceberg-spark-runtime-3.5_2.12-1.5.0.jar ...
	[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.5.0!iceberg-spark-runtime-3.5_2.12.jar (1224ms)
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.5.0/iceberg-aws-bundle-1.5.0.jar ...
	[SUCCESSFUL ] org.apache.iceberg#iceberg-aws-bundle;1.5.0!iceberg-aws-bundle.jar (873ms)
downloading https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.5.0/spark-avro_2.12-3.5.0.jar ...
	[SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.5.0!spark-avro_2.12.jar (97ms)
downloading https://repo1.maven.org/maven2/org/tukaani/xz/1.9/xz-1.9.jar ...
	[SUCCESSFUL ] org.tukaani#xz;1

Spark version: 3.5.7
Spark app name: feast_explorer


## 3. Initialize Feast Feature Store

Connect to the Feast feature store with Spark offline store configuration.

In [6]:
from feast import FeatureStore

# Path to feast feature_store.yaml
feast_repo_path = project_root / "feature_stores" / "feast_store"

# Initialize the feature store
store = FeatureStore(repo_path=str(feast_repo_path))

print(f"Feast project: {store.project}")
print(f"Registry path: {store.config.registry}")
print(f"Offline store type: {store.config.offline_store.type}")



Feast project: dfp
Registry path: registry_type='file' registry_store_type=None path='data/registry.db' cache_ttl_seconds=60 s3_additional_kwargs=None purge_feast_metadata=False
Offline store type: spark


## 4. List All Feature Views

Display all registered feature views in the Feast registry.

In [7]:
import pandas as pd

# Get all feature views
feature_views = store.list_feature_views()
batch_feature_views = store.list_batch_feature_views()
on_demand_feature_views = store.list_on_demand_feature_views()

print(f"\nüìä Feature Views Summary")
print(f"{'='*50}")
print(f"Regular Feature Views: {len(feature_views)}")
print(f"Batch Feature Views: {len(batch_feature_views)}")
print(f"On-Demand Feature Views: {len(on_demand_feature_views)}")
print(f"{'='*50}")


üìä Feature Views Summary
Regular Feature Views: 2
Batch Feature Views: 2
On-Demand Feature Views: 1


## 5. Feature Views Details

Display detailed information about each feature view.

In [8]:
def display_feature_view_details(fv):
    """Display detailed information about a feature view."""
    print(f"\nüîπ Feature View: {fv.name}")
    print(f"   {'‚îÄ'*45}")
    
    # Entities
    entity_names = [e.name if hasattr(e, 'name') else str(e) for e in fv.entities]
    print(f"   Entities: {', '.join(entity_names)}")
    
    # TTL
    print(f"   TTL: {fv.ttl}")
    
    # Online serving
    online = getattr(fv, 'online', 'N/A')
    print(f"   Online: {online}")
    
    # Tags
    tags = getattr(fv, 'tags', {})
    if tags:
        print(f"   Tags: {tags}")
    
    # Source
    source = getattr(fv, 'batch_source', getattr(fv, 'source', None))
    if source:
        source_name = getattr(source, 'name', type(source).__name__)
        print(f"   Source: {source_name}")
        if hasattr(source, 'table'):
            print(f"   Table: {source.table}")
    
    # Schema/Features
    schema = getattr(fv, 'schema', [])
    if schema:
        print(f"   Features ({len(schema)}):")
        for field in schema:
            desc = getattr(field, 'description', '')
            desc_str = f" - {desc}" if desc else ""
            print(f"      ‚Ä¢ {field.name}: {field.dtype}{desc_str}")

# Display regular feature views
print("\n" + "="*60)
print("REGULAR FEATURE VIEWS")
print("="*60)
for fv in feature_views:
    display_feature_view_details(fv)


REGULAR FEATURE VIEWS

üîπ Feature View: malware_batch_features
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Entities: malware_sample
   TTL: 365 days, 0:00:00
   Online: False
   Tags: {'usage': 'training', 'dataset': 'kronodroid', 'team': 'dfp'}
   Source: kronodroid_training_source
   Features (4):
      ‚Ä¢ is_malware: Int64 - Target label
      ‚Ä¢ dataset_split: String
      ‚Ä¢ data_source: String
      ‚Ä¢ sample_id: String

üîπ Feature View: malware_sample_features
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Entities: malware_sample
   TTL: 365 days, 0:00:00
   Online: True
   Tags: {'dataset': 'kronodroid', 'team': 'dfp'}
   Source: kronodroid_training_source
   Features (5):
      ‚Ä¢ sample_id: String
      ‚Ä¢ data_source: String - emulator or real_device
      ‚Ä¢ dataset_split

In [9]:
# Display batch feature views
print("\n" + "="*60)
print("BATCH FEATURE VIEWS")
print("="*60)
for fv in batch_feature_views:
    display_feature_view_details(fv)


BATCH FEATURE VIEWS

üîπ Feature View: malware_batch_features
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Entities: malware_sample
   TTL: 365 days, 0:00:00
   Online: False
   Tags: {'usage': 'training', 'dataset': 'kronodroid', 'team': 'dfp'}
   Source: kronodroid_training_source
   Features (4):
      ‚Ä¢ is_malware: Int64 - Target label
      ‚Ä¢ dataset_split: String
      ‚Ä¢ data_source: String
      ‚Ä¢ sample_id: String

üîπ Feature View: malware_sample_features
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Entities: malware_sample
   TTL: 365 days, 0:00:00
   Online: True
   Tags: {'dataset': 'kronodroid', 'team': 'dfp'}
   Source: kronodroid_training_source
   Features (5):
      ‚Ä¢ sample_id: String
      ‚Ä¢ data_source: String - emulator or real_device
      ‚Ä¢ dataset_split: 

In [10]:
# Display on-demand feature views
print("\n" + "="*60)
print("ON-DEMAND FEATURE VIEWS")
print("="*60)
for odfv in on_demand_feature_views:
    print(f"\nüî∏ On-Demand Feature View: {odfv.name}")
    print(f"   {'‚îÄ'*45}")
    
    # Source feature views
    sources = list(odfv.source_feature_view_projections.keys())
    print(f"   Source FVs: {', '.join(sources)}")
    
    # Schema
    schema = getattr(odfv, 'schema', [])
    if schema:
        print(f"   Computed Features ({len(schema)}):")
        for field in schema:
            print(f"      ‚Ä¢ {field.name}: {field.dtype}")


ON-DEMAND FEATURE VIEWS

üî∏ On-Demand Feature View: malware_derived_features
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   Source FVs: malware_sample_features
   Computed Features (2):
      ‚Ä¢ is_emulator_sample: Int64
      ‚Ä¢ __dummy_id: String


## 6. Entities

List all entities defined in the feature store.

In [11]:
# List all entities
entities = store.list_entities()

print("\n" + "="*60)
print("ENTITIES")
print("="*60)

entity_data = []
for entity in entities:
    entity_data.append({
        "Name": entity.name,
        "Join Keys": ", ".join(entity.join_keys),
        "Value Type": str(entity.value_type),
        "Description": entity.description or "N/A"
    })

entities_df = pd.DataFrame(entity_data)
display(entities_df)


ENTITIES


AttributeError: 'Entity' object has no attribute 'join_keys'

## 7. Data Sources

List all data sources configured in the feature store.

In [12]:
# List all data sources
data_sources = store.list_data_sources()

print("\n" + "="*60)
print("DATA SOURCES")
print("="*60)

source_data = []
for source in data_sources:
    source_info = {
        "Name": source.name,
        "Type": type(source).__name__,
    }
    
    # Add table info for SparkSource
    if hasattr(source, 'table'):
        source_info["Table"] = source.table
    
    # Add timestamp field
    if hasattr(source, 'timestamp_field'):
        source_info["Timestamp Field"] = source.timestamp_field
    
    source_data.append(source_info)

sources_df = pd.DataFrame(source_data)
display(sources_df)


DATA SOURCES


Unnamed: 0,Name,Type,Timestamp Field
0,kronodroid_categories_source,FileSource,_dbt_loaded_at
1,kronodroid_training_source,FileSource,event_timestamp
2,kronodroid_samples_source,FileSource,event_timestamp
3,kronodroid_push_source,PushSource,


## 8. Feature Views Summary Table

Create a summary table of all feature views with their key attributes.

In [13]:
def get_fv_summary(fv, fv_type="FeatureView"):
    """Extract summary info from a feature view."""
    schema = getattr(fv, 'schema', [])
    entities = [e.name if hasattr(e, 'name') else str(e) for e in fv.entities] if hasattr(fv, 'entities') else []
    tags = getattr(fv, 'tags', {})
    source = getattr(fv, 'batch_source', getattr(fv, 'source', None))
    source_name = getattr(source, 'name', 'N/A') if source else 'N/A'
    
    return {
        "Name": fv.name,
        "Type": fv_type,
        "Entities": ", ".join(entities),
        "# Features": len(schema),
        "TTL": str(getattr(fv, 'ttl', 'N/A')),
        "Online": getattr(fv, 'online', 'N/A'),
        "Source": source_name,
        "Tags": ", ".join(f"{k}={v}" for k, v in tags.items()) if tags else "N/A"
    }

# Collect all feature views
all_fv_data = []

for fv in feature_views:
    all_fv_data.append(get_fv_summary(fv, "FeatureView"))

for fv in batch_feature_views:
    all_fv_data.append(get_fv_summary(fv, "BatchFeatureView"))

for odfv in on_demand_feature_views:
    sources = list(odfv.source_feature_view_projections.keys())
    schema = getattr(odfv, 'schema', [])
    all_fv_data.append({
        "Name": odfv.name,
        "Type": "OnDemandFeatureView",
        "Entities": "N/A",
        "# Features": len(schema),
        "TTL": "N/A",
        "Online": True,
        "Source": ", ".join(sources),
        "Tags": "N/A"
    })

fv_summary_df = pd.DataFrame(all_fv_data)
print("\nüìã Feature Views Summary Table")
print("="*80)
display(fv_summary_df)


üìã Feature Views Summary Table


Unnamed: 0,Name,Type,Entities,# Features,TTL,Online,Source,Tags
0,malware_batch_features,FeatureView,malware_sample,4,"365 days, 0:00:00",False,kronodroid_training_source,"usage=training, dataset=kronodroid, team=dfp"
1,malware_sample_features,FeatureView,malware_sample,5,"365 days, 0:00:00",True,kronodroid_training_source,"dataset=kronodroid, team=dfp"
2,malware_batch_features,BatchFeatureView,malware_sample,4,"365 days, 0:00:00",False,kronodroid_training_source,"usage=training, dataset=kronodroid, team=dfp"
3,malware_sample_features,BatchFeatureView,malware_sample,5,"365 days, 0:00:00",True,kronodroid_training_source,"dataset=kronodroid, team=dfp"
4,malware_derived_features,OnDemandFeatureView,,2,,True,malware_sample_features,


## 9. Query Feature View with Spark (Example)

Demonstrate how to fetch historical features using the Spark offline store.

In [14]:
from datetime import datetime, timedelta

# Create a sample entity DataFrame for historical feature retrieval
# This would typically come from your application data
entity_df = pd.DataFrame({
    "sample_id": ["sample_001", "sample_002", "sample_003"],
    "event_timestamp": [
        datetime.now() - timedelta(days=1),
        datetime.now() - timedelta(days=2),
        datetime.now() - timedelta(days=3),
    ]
})

print("Sample entity DataFrame:")
display(entity_df)

Sample entity DataFrame:


Unnamed: 0,sample_id,event_timestamp
0,sample_001,2025-12-13 15:44:17.676131
1,sample_002,2025-12-12 15:44:17.676154
2,sample_003,2025-12-11 15:44:17.676156


In [15]:
# Uncomment to fetch historical features (requires running infrastructure)
# This uses the Spark offline store configured in feature_store.yaml

feature_refs = [
    "malware_sample_features:app_package",
    "malware_sample_features:is_malware",
    "malware_sample_features:data_source",
    "malware_sample_features:dataset_split",
]

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=feature_refs,
).to_df()

print("Historical features retrieved via Spark:")
display(training_df)

print("‚ÑπÔ∏è  Historical feature retrieval is commented out.")
print("   Uncomment the code above when infrastructure (LakeFS, Spark, Iceberg) is running.")

AssertionError: 

## 10. Registry Inspection

Inspect the Feast registry directly for additional metadata.

In [16]:
# Get registry information
print("\n" + "="*60)
print("REGISTRY INFORMATION")
print("="*60)

print(f"\nProject: {store.project}")
print(f"Provider: {store.config.provider}")
print(f"\nOffline Store Configuration:")
print(f"  Type: {store.config.offline_store.type}")

# Show Spark configuration from the offline store
if hasattr(store.config.offline_store, 'spark_conf'):
    print(f"\nSpark Configuration:")
    for key, value in store.config.offline_store.spark_conf.items():
        # Mask sensitive values
        if 'secret' in key.lower() or 'password' in key.lower() or 'key' in key.lower():
            print(f"    {key}: ***")
        else:
            print(f"    {key}: {value}")

print(f"\nOnline Store Configuration:")
print(f"  Type: {store.config.online_store.type}")


REGISTRY INFORMATION

Project: dfp
Provider: local

Offline Store Configuration:
  Type: spark

Spark Configuration:
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.lakefs_catalog: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.lakefs_catalog.type: hadoop
    spark.sql.catalog.lakefs_catalog.warehouse: ${LAKEFS_WAREHOUSE:-s3a://kronodroid/main/iceberg}
    spark.hadoop.fs.s3a.endpoint: ${LAKEFS_ENDPOINT_URL:-http://localhost:8000}
    spark.hadoop.fs.s3a.access.key: ***
    spark.hadoop.fs.s3a.secret.key: ***
    spark.hadoop.fs.s3a.path.style.access: true
    spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.connection.ssl.enabled: false
    spark.sql.iceberg.write.format.default: avro
    spark.driver.memory: 2g
    spark.executor.memory: 2g

Online Store Configuration:
  Type: redis


## 11. Cleanup

In [None]:
# Stop Spark session when done
# Uncomment if you want to stop the session
spark.stop()

print("\n‚úÖ Notebook complete!")
print("   Spark session is still active. Call spark.stop() when finished.")