# Feast Feature Store: Historical Features

This notebook demonstrates how to:
- Connect to the Feast feature store
- Explore available entities and feature views
- Retrieve historical features for model training
- Work with the Kronodroid malware detection dataset

## Setup & Configuration

In [1]:
import os
from datetime import datetime, timedelta
from pathlib import Path

import pandas as pd
from IPython.display import display, Markdown

# Feast imports
from feast import FeatureStore

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [2]:
# Configuration - Environment variables for LakeFS and Redis
# These should be set for the Spark offline store and Redis online store

# LakeFS Configuration (for Iceberg offline store)
os.environ.setdefault("LAKEFS_ENDPOINT_URL", "http://localhost:8000")
os.environ.setdefault("LAKEFS_ACCESS_KEY_ID", "AKIAJWAE4BUBMLQESYDQ")
os.environ.setdefault("LAKEFS_SECRET_ACCESS_KEY", "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")

# Redis Configuration (for online store)
os.environ.setdefault("REDIS_CONNECTION_STRING", "redis://localhost:6379")

print(f"LakeFS Endpoint: {os.environ.get('LAKEFS_ENDPOINT_URL')}")
print(f"Redis: {os.environ.get('REDIS_CONNECTION_STRING')}")

LakeFS Endpoint: http://localhost:8000
Redis: redis://localhost:16379


In [3]:
# Initialize Feast Feature Store
# The feature_store.yaml is located in the feast_store directory

FEAST_REPO_PATH = Path("../feature_stores/feast_store")

try:
    store = FeatureStore(repo_path=str(FEAST_REPO_PATH))
    print(f"Connected to Feast project: {store.project}")
    print(f"Registry path: {store.config.registry}")
except Exception as e:
    print(f"Error connecting to Feast: {e}")
    raise



Connected to Feast project: dfp
Registry path: registry_type='file' registry_store_type=None path='data/registry.db' cache_ttl_seconds=60 cache_mode='sync' s3_additional_kwargs=None purge_feast_metadata=False


---
## Explore Feature Store Registry

List all available entities, feature views, and data sources.

### List Entities

In [4]:
def list_entities(store: FeatureStore):
    """List all entities in the feature store."""
    entities = store.list_entities()
    
    if not entities:
        print("No entities found in the registry.")
        return
    
    data = []
    for entity in entities:
        data.append({
            "Name": entity.name,
            "Join Key": entity.join_key,
            "Value Type": str(entity.value_type),
            "Description": entity.description or "-",
        })
    
    df = pd.DataFrame(data)
    display(df)
    return entities

entities = list_entities(store)

: 

### List Feature Views

In [5]:
def list_feature_views(store: FeatureStore):
    """List all feature views in the feature store."""
    feature_views = store.list_feature_views()
    
    if not feature_views:
        print("No feature views found in the registry.")
        return
    
    data = []
    for fv in feature_views:
        entities_str = ", ".join([e.name for e in fv.entity_columns]) if fv.entity_columns else "-"
        feature_count = len(fv.schema) if fv.schema else 0
        tags_str = ", ".join([f"{k}={v}" for k, v in fv.tags.items()]) if fv.tags else "-"
        
        data.append({
            "Name": fv.name,
            "Entities": entities_str,
            "Features": feature_count,
            "TTL": str(fv.ttl) if fv.ttl else "None",
            "Online": fv.online,
            "Tags": tags_str,
        })
    
    df = pd.DataFrame(data)
    display(df)
    return feature_views

feature_views = list_feature_views(store)



Unnamed: 0,Name,Entities,Features,TTL,Online,Tags
0,user_daily_features,-,1,,True,-
1,malware_family_features,family_id,10,"30 days, 0:00:00",True,"dataset=kronodroid, team=dfp"
2,malware_batch_features,sample_id,4,"365 days, 0:00:00",False,"usage=training, dataset=kronodroid, team=dfp"
3,malware_sample_features,sample_id,28,"365 days, 0:00:00",True,"dataset=kronodroid, team=dfp"


### List On-Demand Feature Views

In [None]:
def list_on_demand_feature_views(store: FeatureStore):
    """List all on-demand feature views."""
    odfvs = store.list_on_demand_feature_views()
    
    if not odfvs:
        print("No on-demand feature views found in the registry.")
        return
    
    data = []
    for odfv in odfvs:
        sources_str = ", ".join([s.name for s in odfv.source_feature_view_projections.values()])
        feature_names = [f.name for f in odfv.schema] if odfv.schema else []
        
        data.append({
            "Name": odfv.name,
            "Sources": sources_str,
            "Features": ", ".join(feature_names),
        })
    
    df = pd.DataFrame(data)
    display(df)
    return odfvs

on_demand_fvs = list_on_demand_feature_views(store)

### Inspect Feature View Schema

In [None]:
def inspect_feature_view(store: FeatureStore, feature_view_name: str):
    """Display detailed schema for a feature view."""
    try:
        fv = store.get_feature_view(feature_view_name)
    except Exception as e:
        print(f"Feature view '{feature_view_name}' not found: {e}")
        return
    
    print(f"Feature View: {fv.name}")
    print(f"Description: {fv.description or 'N/A'}")
    print(f"TTL: {fv.ttl}")
    print(f"Online: {fv.online}")
    print(f"Source: {fv.batch_source.name if fv.batch_source else 'N/A'}")
    print("\nSchema:")
    
    if fv.schema:
        data = []
        for field in fv.schema:
            data.append({
                "Feature": field.name,
                "Type": str(field.dtype),
                "Description": field.description or "-",
            })
        display(pd.DataFrame(data))
    else:
        print("  No schema defined")
    
    return fv

# Inspect the main malware sample features
fv = inspect_feature_view(store, "malware_sample_features")

---
## Get Historical Features

Retrieve historical features for model training using point-in-time correct joins.

### Create Entity DataFrame

The entity DataFrame specifies which entities and timestamps to retrieve features for.

In [None]:
# Create a sample entity DataFrame for historical feature retrieval
# This represents the entities (samples) and timestamps for which we want features

# For demonstration, we'll create sample IDs
# In practice, these would come from your training/inference data

sample_entity_df = pd.DataFrame({
    "sample_id": [
        "sample_001",
        "sample_002",
        "sample_003",
        "sample_004",
        "sample_005",
    ],
    "event_timestamp": [
        datetime.now() - timedelta(days=1),
        datetime.now() - timedelta(days=2),
        datetime.now() - timedelta(days=3),
        datetime.now() - timedelta(days=4),
        datetime.now() - timedelta(days=5),
    ],
})

print("Entity DataFrame for historical feature retrieval:")
display(sample_entity_df)

### Retrieve Historical Features

In [None]:
def get_historical_features(
    store: FeatureStore,
    entity_df: pd.DataFrame,
    features: list[str],
):
    """
    Retrieve historical features from the feature store.
    
    Args:
        store: Feast FeatureStore instance
        entity_df: DataFrame with entity keys and event_timestamp
        features: List of feature references (e.g., "feature_view:feature_name")
    
    Returns:
        DataFrame with historical features joined to entity DataFrame
    """
    print(f"Requesting {len(features)} features for {len(entity_df)} entities...")
    
    try:
        # Get historical features using point-in-time join
        training_df = store.get_historical_features(
            entity_df=entity_df,
            features=features,
        ).to_df()
        
        print(f"Retrieved {len(training_df)} rows with {len(training_df.columns)} columns")
        return training_df
    
    except Exception as e:
        print(f"Error retrieving historical features: {e}")
        raise

In [None]:
# Define features to retrieve
# Format: "feature_view_name:feature_name"

SAMPLE_FEATURES = [
    # Target and metadata
    "malware_sample_features:label",
    "malware_sample_features:malware_family",
    "malware_sample_features:data_source",
    "malware_sample_features:dataset_split",
    # Syscall features (subset for demo)
    "malware_sample_features:syscall_1_normalized",
    "malware_sample_features:syscall_2_normalized",
    "malware_sample_features:syscall_3_normalized",
    "malware_sample_features:syscall_4_normalized",
    "malware_sample_features:syscall_5_normalized",
    # Aggregated features
    "malware_sample_features:syscall_total",
    "malware_sample_features:syscall_mean",
]

print("Features to retrieve:")
for f in SAMPLE_FEATURES:
    print(f"  - {f}")

In [None]:
# Retrieve historical features
# Note: This requires the Spark offline store to be running
# and LakeFS/Iceberg tables to be populated

try:
    training_df = get_historical_features(
        store=store,
        entity_df=sample_entity_df,
        features=SAMPLE_FEATURES,
    )
    
    print("\nHistorical features retrieved successfully:")
    display(training_df.head())
    
except Exception as e:
    print(f"\nNote: Historical feature retrieval requires:")
    print("  1. LakeFS with Iceberg tables populated")
    print("  2. Spark offline store configured and running")
    print(f"\nError: {e}")

### Retrieve All Syscall Features

In [None]:
# Build full feature list with all syscall features

ALL_SAMPLE_FEATURES = [
    # Target and metadata
    "malware_sample_features:label",
    "malware_sample_features:malware_family",
    "malware_sample_features:first_seen_year",
    "malware_sample_features:data_source",
    "malware_sample_features:dataset_split",
]

# Add all 20 syscall features
for i in range(1, 21):
    ALL_SAMPLE_FEATURES.append(f"malware_sample_features:syscall_{i}_normalized")

# Add aggregated features
ALL_SAMPLE_FEATURES.extend([
    "malware_sample_features:syscall_total",
    "malware_sample_features:syscall_mean",
])

print(f"Total features: {len(ALL_SAMPLE_FEATURES)}")

### Get Features with Derived Features (On-Demand)

In [None]:
# Include on-demand derived features
# These are computed at retrieval time from the base features

FEATURES_WITH_DERIVED = SAMPLE_FEATURES + [
    "malware_derived_features:syscall_variance",
    "malware_derived_features:is_high_activity",
]

try:
    training_with_derived = get_historical_features(
        store=store,
        entity_df=sample_entity_df,
        features=FEATURES_WITH_DERIVED,
    )
    
    print("\nFeatures with derived columns:")
    display(training_with_derived.head())
    
except Exception as e:
    print(f"Note: On-demand features require base features to be available.")
    print(f"Error: {e}")

---
## Get Family Features

In [None]:
# Inspect family feature view
inspect_feature_view(store, "malware_family_features")

In [None]:
# Create entity DataFrame for family features
family_entity_df = pd.DataFrame({
    "family_id": [
        "benign",
        "adware",
        "banking",
        "ransomware",
    ],
    "event_timestamp": [datetime.now()] * 4,
})

FAMILY_FEATURES = [
    "malware_family_features:family_name",
    "malware_family_features:is_malware_family",
    "malware_family_features:total_samples",
    "malware_family_features:unique_samples",
    "malware_family_features:emulator_count",
    "malware_family_features:real_device_count",
    "malware_family_features:earliest_year",
    "malware_family_features:latest_year",
]

try:
    family_df = get_historical_features(
        store=store,
        entity_df=family_entity_df,
        features=FAMILY_FEATURES,
    )
    
    print("\nFamily features:")
    display(family_df)
    
except Exception as e:
    print(f"Error: {e}")

---
## Feature Store Statistics

In [None]:
def print_feature_store_stats(store: FeatureStore):
    """Print summary statistics about the feature store."""
    entities = store.list_entities()
    feature_views = store.list_feature_views()
    on_demand_fvs = store.list_on_demand_feature_views()
    data_sources = store.list_data_sources()
    
    total_features = sum(
        len(fv.schema) if fv.schema else 0 
        for fv in feature_views
    )
    
    print(f"Feature Store: {store.project}")
    print(f"="*40)
    print(f"Entities:              {len(entities)}")
    print(f"Feature Views:         {len(feature_views)}")
    print(f"On-Demand Views:       {len(on_demand_fvs)}")
    print(f"Data Sources:          {len(data_sources)}")
    print(f"Total Features:        {total_features}")
    
    # List data sources
    print(f"\nData Sources:")
    for ds in data_sources:
        print(f"  - {ds.name}")

print_feature_store_stats(store)

---
## Example: Prepare Training Dataset

In [None]:
def prepare_training_dataset(
    store: FeatureStore,
    entity_df: pd.DataFrame,
    feature_refs: list[str],
    label_column: str = "label",
):
    """
    Prepare a training dataset from historical features.
    
    Args:
        store: Feast FeatureStore instance
        entity_df: DataFrame with entity keys and timestamps
        feature_refs: List of feature references
        label_column: Name of the label column
    
    Returns:
        Tuple of (X, y) for model training
    """
    # Get historical features
    df = store.get_historical_features(
        entity_df=entity_df,
        features=feature_refs,
    ).to_df()
    
    # Separate features and labels
    feature_columns = [
        col for col in df.columns 
        if col not in [label_column, "event_timestamp"] 
        and not col.endswith("_id")
    ]
    
    # Select only numeric columns for features
    numeric_features = df[feature_columns].select_dtypes(include=["number"]).columns.tolist()
    
    X = df[numeric_features]
    y = df[label_column] if label_column in df.columns else None
    
    print(f"Training dataset shape: X={X.shape}, y={y.shape if y is not None else 'N/A'}")
    print(f"Feature columns: {numeric_features}")
    
    return X, y

# Example usage (uncomment when data is available)
# X_train, y_train = prepare_training_dataset(
#     store=store,
#     entity_df=sample_entity_df,
#     feature_refs=SAMPLE_FEATURES,
#     label_column="label",
# )

---
## Apply Registry Changes

If you've made changes to the feature definitions, apply them to the registry.

In [None]:
# Apply any pending changes to the registry
# This syncs feature definitions from Python files to the registry

# Uncomment to apply changes:
# store.apply([
#     # Add entities and feature views here
# ])

# Or apply all objects from the repo:
# from feast.repo_operations import apply_total
# apply_total(store.config, store.repo_path)

---
## Summary

This notebook demonstrated:

1. **Connecting to Feast** - Initialize the FeatureStore with the repo path
2. **Exploring the registry** - List entities, feature views, and schemas
3. **Getting historical features** - Use `get_historical_features()` with entity DataFrames
4. **On-demand features** - Retrieve computed features at query time
5. **Preparing training data** - Extract features and labels for ML models

### Key Points

- Historical features use **point-in-time joins** to prevent data leakage
- Entity DataFrames must include **entity keys** and **event_timestamp**
- Feature references follow the format: `feature_view_name:feature_name`
- On-demand features are computed at retrieval time from source features

### Next Steps

- Populate the Iceberg tables in LakeFS with training data
- Materialize features to the online store for real-time serving
- Use `get_online_features()` for low-latency inference