# Spark & Feast Testing Notebook

This notebook provides a complete environment for testing the integration between **Spark**, **LakeFS (Iceberg)**, and **Feast**.

## Objectives:
1.  **Initialize Spark** with Iceberg and S3A (LakeFS) configurations.
2.  **Explore Data** in LakeFS Iceberg tables.
3.  **Interact with Feast** to manage and retrieve features.
4.  **Validate Integration** between Spark offline store and Feast.

## 1. Setup & Configuration

First, we'll set up the environment variables and import necessary libraries.

In [None]:
import os
import sys
import pandas as pd
from datetime import datetime, timedelta

# Set environment variables for LakeFS and Feast if they aren't already set
# Update these values to match your local environment if necessary
os.environ["LAKEFS_ENDPOINT_URL"] = os.environ.get("LAKEFS_ENDPOINT_URL", "http://localhost:8000")
os.environ["LAKEFS_ACCESS_KEY_ID"] = os.environ.get("LAKEFS_ACCESS_KEY_ID", "AKIAJWAE4BUBMLQESYDQ")
os.environ["LAKEFS_SECRET_ACCESS_KEY"] = os.environ.get("LAKEFS_SECRET_ACCESS_KEY", "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")
os.environ["LAKEFS_REPOSITORY"] = os.environ.get("LAKEFS_REPOSITORY", "kronodroid")
os.environ["LAKEFS_BRANCH"] = os.environ.get("LAKEFS_BRANCH", "main")
os.environ["REDIS_CONNECTION_STRING"] = os.environ.get("REDIS_CONNECTION_STRING", "redis://localhost:6379")

print(f"LakeFS Repository: {os.environ['LAKEFS_REPOSITORY']}")
print(f"LakeFS Branch:     {os.environ['LAKEFS_BRANCH']}")
print(f"LakeFS Endpoint:   {os.environ['LAKEFS_ENDPOINT_URL']}")
print(f"Redis Connection:  {os.environ['REDIS_CONNECTION_STRING']}")

## 2. Initialize Spark

We'll initialize a Spark session configured to use the LakeFS S3 gateway and the Iceberg catalog.

In [None]:
from pyspark.sql import SparkSession

repo = os.environ["LAKEFS_REPOSITORY"]
branch = os.environ["LAKEFS_BRANCH"]
endpoint = os.environ["LAKEFS_ENDPOINT_URL"]
access_key = os.environ["LAKEFS_ACCESS_KEY_ID"]
secret_key = os.environ["LAKEFS_SECRET_ACCESS_KEY"]

spark = (SparkSession.builder
    .appName("Spark Feast Test")
    # Iceberg extensions
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    # LakeFS Iceberg catalog (Hadoop-based)
    .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lakefs.type", "hadoop")
    .config("spark.sql.catalog.lakefs.warehouse", f"s3a://{repo}/{branch}/iceberg")
    # S3A filesystem for LakeFS S3 gateway
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
    # Per-bucket config for LakeFS repository
    .config(f"spark.hadoop.fs.s3a.bucket.{repo}.endpoint", endpoint)
    .config(f"spark.hadoop.fs.s3a.bucket.{repo}.access.key", access_key)
    .config(f"spark.hadoop.fs.s3a.bucket.{repo}.secret.key", secret_key)
    # Maven packages for Iceberg + S3A (compatible with Spark 3.5.0)
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262")
    .getOrCreate())

print("✅ Spark session initialized")

## 3. Explore LakeFS Iceberg Tables

Let's see what tables are available in the LakeFS catalog.

In [None]:
# List databases in lakefs catalog
spark.sql("SHOW NAMESPACES IN lakefs").show()

# List tables in kronodroid database (if it exists)
try:
    spark.sql("SHOW TABLES IN lakefs.kronodroid").show()
except Exception as e:
    print(f"Could not list tables in lakefs.kronodroid: {e}")

In [None]:
# Sample data from fct_training_dataset if it exists
try:
    df = spark.table("lakefs.kronodroid.fct_training_dataset")
    print(f"Total records in fct_training_dataset: {df.count()}")
    df.limit(5).toPandas().display()
except Exception as e:
    print(f"Could not read lakefs.kronodroid.fct_training_dataset: {e}")

## 4. Feast Integration

Now we'll initialize the Feast Feature Store and verify that it can read data from Spark.

In [None]:
from feast import FeatureStore
import os

# The location of your feature_store.yaml or feature_store_spark.yaml
repo_path = os.path.join(os.getcwd(), "../feature_stores/feast_store")
fs = FeatureStore(repo_path=repo_path)

print(f"✅ Feast FeatureStore initialized from: {repo_path}")

### 4.1. List Feature Views & Entities

Let's see what features are defined in Feast.

In [None]:
print("Entities:")
for entity in fs.list_entities():
    print(f"  - {entity.name}")

print("\nFeature Views:")
for fv in fs.list_feature_views():
    print(f"  - {fv.name}")

### 4.2. Retrieval: Historical Features (Offline Store)

We'll retrieve historical features using Spark as the offline store.

In [None]:
import pandas as pd
from datetime import datetime

# Define an entity dataframe for retrieval
# In a real scenario, this would be your training data with timestamps
entity_df = pd.DataFrame.from_dict({
    "sample_hash": [
        "0000ed700543e4776114ebca3eb0df04",
        "00018f2f4c39c4a56c4d8ce7b30cd0f9",
        "0002ba7c18001d9f829f0ce645c9df5e"
    ],
    "event_timestamp": [
        datetime.now(),
        datetime.now(),
        datetime.now()
    ]
})

try:
    # Get historical features
    retrieval_job = fs.get_historical_features(
        entity_df=entity_df,
        features=[
            "malware_sample_features:label",
            "malware_sample_features:syscall_total",
            "malware_sample_features:syscall_mean",
        ],
    )
    
    # Convert to pandas to see results
    result_df = retrieval_job.to_df()
    # Use display() if in a notebook environment, else print
    if 'IPython' in sys.modules:
        from IPython.display import display
        display(result_df)
    else:
        print(result_df)
except Exception as e:
    print(f"❌ Failed to get historical features: {e}")

### 4.3. Retrieval: Online Features (Online Store)

First, we would normally materialize features to the online store. 
Then we can retrieve them for low-latency serving.

In [None]:
# Example: materializing latest features (uncomment to run)
# fs.materialize_incremental(end_date=datetime.now())

In [None]:
try:
    # Get online features
    online_features = fs.get_online_features(
        features=[
            "malware_sample_features:label",
            "malware_sample_features:syscall_total",
        ],
        entity_rows=[
            {"sample_hash": "0000ed700543e4776114ebca3eb0df04"}
        ],
    )
    
    print("Online Features Result:")
    print(online_features.to_dict())
except Exception as e:
    print(f"❌ Failed to get online features: {e}")

## 5. Cleanup

Stop the Spark session when finished.

In [None]:
spark.stop()
print("✅ Spark session stopped")