# Comprehensive Feast Feature Store Tutorial

This notebook demonstrates how to use Feast to create and manage a feature store for machine learning. We'll use a ride-sharing example to show how to handle driver and customer features.

## What is Feast?
Feast is an open-source feature store that helps organizations manage and serve machine learning features to production models. It provides:
- Consistent feature definitions
- Point-in-time correct feature retrieval
- Feature sharing and reuse across projects
- Online and offline feature access

In [1]:
# Install required packages
!pip install feast pandas sqlite_vec



In [2]:
import feast
import pandas as pd
import numpy as np
import os
from datetime import datetime, timedelta
from feast import FeatureStore, Entity, FeatureView, ValueType, FeatureService
from feast import FileSource
from feast.on_demand_feature_view import on_demand_feature_view
from feast.field import Field
from feast.types import Float32, Int64, String

  from pandas.core import (


# Step 1: Set Up a Feature Repository

First, we'll create a new Feast feature repository and set up our project structure.

In [3]:
# Initialize a new feature repository
!feast init ride_sharing_features

  from pandas.core import (

Creating a new Feast repository in [1m[32m/home/shamaseen/Desktop/Shai/qafza/free_Training/ride_sharing_features[0m.



In [4]:
# Navigate to the feature repository directory
%cd ride_sharing_features/feature_repo

/home/shamaseen/Desktop/Shai/qafza/free_Training/ride_sharing_features/feature_repo


# Step 2: Create Sample Datasets

Let's create multiple realistic datasets for our ride-sharing example:

In [5]:
# Load and prepare the diabetes dataset
data = pd.read_csv('https://raw.githubusercontent.com/TripathiAshutosh/feast/main/Feast%20Live%20Demo/diabetes.csv')
data['event_timestamp'] = datetime.now()  # Add timestamp column
data['patient_id'] = range(1, len(data) + 1)  # Add patient ID column

In [6]:

# Save the data locally
os.makedirs("./data", exist_ok=True)
data.to_parquet("./data/diabetes_data.parquet")


# Step 3: Define Data Sources and Entities

Now we'll define our data sources and entities for both drivers and customers.

In [7]:
# Define the source of our feature data
# FileSource tells Feast where to find the feature data and which column contains the timestamp
diabetes_source = FileSource(
    path="data/diabetes_data.parquet",
    event_timestamp_column="event_timestamp",  # Column used for point-in-time joins
)

# Define the patient entity
# Entities are the primary keys used to join and retrieve features
patient = Entity(
    name="patient",
    value_type=ValueType.INT64,  # Data type of the entity
    description="Patient ID",    # Description for documentation
    join_keys=["patient_id"]    # Column(s) used to join features
)


# Step 4: Create Feature Views

Let's create feature views for both driver and customer statistics, plus an on-demand feature view for derived features.

In [8]:
# Define the main feature view for diabetes statistics
# Feature views are groups of features that are stored and retrieved together
diabetes_stats_fv = FeatureView(
    name="diabetes_stats",
    entities=[patient],  # Link to the patient entity
    ttl=timedelta(days=365),  # Time-to-live for features (how long they're considered valid)
    schema=[
        # Define each feature's name and data type
        Field(name="Pregnancies", dtype=Int64),
        Field(name="Glucose", dtype=Float32),
        Field(name="BloodPressure", dtype=Float32),
        Field(name="SkinThickness", dtype=Float32),
        Field(name="Insulin", dtype=Float32),
        Field(name="BMI", dtype=Float32),
        Field(name="DiabetesPedigreeFunction", dtype=Float32),
        Field(name="Age", dtype=Int64),
        Field(name="Outcome", dtype=Int64)
    ],
    source=diabetes_source,
    online=True,  # Enable online serving
    description="Diabetes patient statistics and outcomes"
)

In [9]:
# Define an on-demand feature view for diabetes risk scoring
@on_demand_feature_view(
    sources=[diabetes_stats_fv],
    schema=[
        Field(name="diabetes_risk_score", dtype=Float32),
        Field(name="high_risk", dtype=Int64)
    ]
)

def diabetes_risk_scoring_view(inputs: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate diabetes risk score and risk flag based on patient features.
    
    Args:
        inputs (pd.DataFrame): Input features from diabetes_stats_fv
    
    Returns:
        pd.DataFrame: Calculated risk scores and flags
    """
    df = pd.DataFrame()
    
    # Calculate weighted risk score based on medical features
    df['diabetes_risk_score'] = (
        (inputs['Glucose'] / 200) * 0.3 +      # Glucose level contribution
        (inputs['BMI'] / 50) * 0.2 +           # BMI contribution
        (inputs['Age'] / 100) * 0.15 +         # Age contribution
        (inputs['DiabetesPedigreeFunction']) * 0.15 +  # Genetic contribution
        (inputs['BloodPressure'] / 200) * 0.1 +        # Blood pressure contribution
        (inputs['Insulin'] / 846) * 0.1        # Insulin level contribution
    ).round(3)
    
    # Flag high-risk patients (score > 0.6)
    df['high_risk'] = (df['diabetes_risk_score'] > 0.6).astype(int)
    
    # Ensure correct data types
    df['diabetes_risk_score'] = df['diabetes_risk_score'].astype('float32')
    df['high_risk'] = df['high_risk'].astype('int64')
    
    return df

# Step 5: Create Feature Services

Feature services help organize features into logical groups for different use cases.

In [10]:
# Define a feature service for diabetes monitoring
# Feature services group features for specific use cases
diabetes_monitoring_fs = FeatureService(
    name="diabetes_monitoring",
    features=[
        diabetes_stats_fv,
        diabetes_risk_scoring_view
    ],
    description="Features for monitoring diabetes risk and patient statistics"
)


# Step 6: Apply the Feature Repository

In [11]:
# Initialize and apply the feature store configuration
store = FeatureStore(repo_path=".")
store.apply([
    patient,
    diabetes_stats_fv,
    diabetes_risk_scoring_view,
    diabetes_monitoring_fs
])

print("Feature repository applied successfully!")

Feature repository applied successfully!




# Step 7: Feature Retrieval Examples

Let's demonstrate various ways to retrieve features from our feature store.

In [12]:
# Example: Retrieve features for training
def get_training_features(patient_ids=None):
    """
    Retrieve historical features for model training.
    
    Args:
        patient_ids: List of patient IDs to retrieve features for
                    If None, defaults to first 10 patients
    
    Returns:
        pd.DataFrame: Training dataset with features and timestamps
    """
    if patient_ids is None:
        patient_ids = range(1, 11)
        
    # Create entity dataframe with timestamps
    training_entity_df = pd.DataFrame({
        "patient_id": patient_ids,
        "event_timestamp": [datetime.now() for _ in patient_ids]
    })
    
    training_df = store.get_historical_features(
        entity_df=training_entity_df,
        features=[
            "diabetes_stats:Glucose",
            "diabetes_stats:BMI",
            "diabetes_stats:Age",
            "diabetes_stats:DiabetesPedigreeFunction",  
            "diabetes_stats:BloodPressure",
            "diabetes_stats:Insulin",   
            "diabetes_stats:Outcome",
            "diabetes_risk_scoring_view:diabetes_risk_score",
            "diabetes_risk_scoring_view:high_risk"
        ]
    ).to_df()
    
    return training_df


In [13]:
# Get training features for first 5 patients
print("\nTraining Features Sample:")
get_training_features(range(1, 6))


Training Features Sample:


  klass=_int.DatetimeTZBlock,
  klass=_int.DatetimeTZBlock,


Unnamed: 0,patient_id,event_timestamp,Glucose,BMI,Age,DiabetesPedigreeFunction,BloodPressure,Insulin,Outcome,diabetes_risk_score,high_risk
0,1,2025-01-22 23:10:59.041165+00:00,148,33.6,50,0.627,72,0,1,0.561,0
1,2,2025-01-22 23:10:59.041167+00:00,85,26.6,31,0.351,66,0,0,0.366,0
2,3,2025-01-22 23:10:59.041167+00:00,183,23.3,32,0.672,64,0,1,0.548,0
3,4,2025-01-22 23:10:59.041167+00:00,89,28.1,21,0.167,66,94,0,0.347,0
4,5,2025-01-22 23:10:59.041168+00:00,137,43.1,33,2.288,40,168,1,0.81,1


In [14]:
## (Loading the features to online store)
## There are two ways you can use to load features to your online store

## - materialize
## materialize loads the latest features between two dates.

## feast materialize 2020–01–01T00:00:00 2022–01–01T00:00:00

## - materialize-incremental
## materialize-incremental loads features up to the provided end date:

## feast materialize-incremental 2022–01–01T00:00:00

def materialize_features():
    """
    Materialize features to the online store.
    
    This function is crucial for online serving as it:
    1. Takes the latest feature values from the offline store
    2. Transforms them into a format suitable for fast retrieval
    3. Loads them into the online store
    4. Ensures features are available for real-time serving
    
    Without materialization, online feature retrieval will return None values.
    """

    # store.materialize(start_date=datetime.utcnow() - timedelta(days=530), end_date=datetime.utcnow() - timedelta(days=10))
    
    store.materialize_incremental(end_date=datetime.now())

In [15]:
# Example: Get online features for real-time prediction
def get_online_features(patient_id):
    """
    Retrieve real-time features for a single patient.
    
    This function is used for making predictions on new or current patients.
    It requires features to be materialized in the online store.
    
    Args:
        patient_id: ID of the patient to retrieve features for
    
    Returns:
        dict: Dictionary containing the patient's features
    """
    try:
        # Ensure features are materialized
        materialize_features()
        
        # Retrieve online features
        online_features = store.get_online_features(
            entity_rows=[{"patient_id": patient_id}],
            features=[
                "diabetes_stats:Glucose",
                "diabetes_stats:BMI",
                "diabetes_stats:Age",
                "diabetes_stats:DiabetesPedigreeFunction",
                "diabetes_stats:BloodPressure",
                "diabetes_stats:Insulin",
                "diabetes_stats:Outcome",
                "diabetes_risk_scoring_view:diabetes_risk_score",
                "diabetes_risk_scoring_view:high_risk"
            ]
        ).to_dict()
        
        return online_features
    except Exception as e:
        print(f"Error getting online features: {e}")
        return None

    

In [16]:
# Get online features for patient ID 1
print("\nOnline Features Sample:")
print(get_online_features(patient_id=[1]))


Online Features Sample:
Materializing [1m[32m1[0m feature views to [1m[32m2025-01-22 23:10:59+03:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdiabetes_stats[0m from [1m[32m2024-01-23 23:10:59+03:00[0m to [1m[32m2025-01-22 23:10:59+03:00[0m:


  0%|                                                                       | 0/768 [00:00<?, ?it/s]



100%|██████████████████████████████████████████████████████████| 768/768 [00:00<00:00, 10919.37it/s]

Error getting online features: int() argument must be a string, a bytes-like object or a real number, not 'list'
None





# Step 8: Clean Up and Best Practices

In [17]:
def check_feature_store_health():
    """
    Check the health and status of the feature store.
    
    This function provides visibility into:
    1. Available feature views
    2. Configured feature services
    3. Registered entities
    4. Data sources
    
    Useful for debugging and monitoring the feature store setup.
    """
    health_status = {
        "feature_views": store.list_feature_views(),
        "feature_services": store.list_feature_services(),
        "entities": store.list_entities(),
        "data_sources": [source.path for view in store.list_feature_views() 
                        for source in [view.batch_source]]
    }
    
    print("Feature Store Health Check:")
    for component, items in health_status.items():
        print(f"\n{component.title()}:")
        for item in items:
            print(f"  - {item}")


In [18]:
check_feature_store_health()

Feature Store Health Check:

Feature_Views:
  - {
  "spec": {
    "name": "diabetes_stats",
    "entities": [
      "patient"
    ],
    "features": [
      {
        "name": "Pregnancies",
        "valueType": "INT64"
      },
      {
        "name": "Glucose",
        "valueType": "FLOAT"
      },
      {
        "name": "BloodPressure",
        "valueType": "FLOAT"
      },
      {
        "name": "SkinThickness",
        "valueType": "FLOAT"
      },
      {
        "name": "Insulin",
        "valueType": "FLOAT"
      },
      {
        "name": "BMI",
        "valueType": "FLOAT"
      },
      {
        "name": "DiabetesPedigreeFunction",
        "valueType": "FLOAT"
      },
      {
        "name": "Age",
        "valueType": "INT64"
      },
      {
        "name": "Outcome",
        "valueType": "INT64"
      }
    ],
    "ttl": "31536000s",
    "batchSource": {
      "type": "BATCH_FILE",
      "timestampField": "event_timestamp",
      "fileOptions": {
        "uri": "data/d

# Step 9 : Traning

In [40]:
from feast import FeatureStore
from feast.infra.offline_stores.file_source import SavedDatasetFileStorage

store = FeatureStore(repo_path='.')

entity_df = pd.read_parquet(path = 'data/diabetes_data.parquet')[['patient_id','event_timestamp']]

training_data = store.get_historical_features(
entity_df = entity_df,
    features = [
        "diabetes_stats:Pregnancies",
        "diabetes_stats:Glucose",
        "diabetes_stats:BloodPressure",
        "diabetes_stats:SkinThickness",
        "diabetes_stats:Insulin",
        "diabetes_stats:BMI",
        "diabetes_stats:DiabetesPedigreeFunction",
        "diabetes_stats:Age",
        "diabetes_stats:Outcome",
               ]
)

dataset = store.create_saved_dataset(
from_=training_data,
    name = "diabetes_dataset",
    storage = SavedDatasetFileStorage('data/diabetes_dataset1.parquet')
)



In [41]:
pd.read_parquet("data/diabetes_dataset1.parquet")

  klass=_int.DatetimeTZBlock,


Unnamed: 0,patient_id,event_timestamp,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,2025-01-22 23:10:58.353939+00:00,6,148,72,35,0,33.6,0.627,50,1
1,507,2025-01-22 23:10:58.353939+00:00,0,180,90,26,90,36.5,0.314,35,1
2,508,2025-01-22 23:10:58.353939+00:00,1,130,60,23,170,28.6,0.692,21,0
3,509,2025-01-22 23:10:58.353939+00:00,2,84,50,23,76,30.4,0.968,21,0
4,510,2025-01-22 23:10:58.353939+00:00,8,120,78,0,0,25.0,0.409,64,0
...,...,...,...,...,...,...,...,...,...,...,...
763,260,2025-01-22 23:10:58.353939+00:00,11,155,76,28,150,33.3,1.353,51,1
764,261,2025-01-22 23:10:58.353939+00:00,3,191,68,15,130,30.9,0.299,34,0
765,262,2025-01-22 23:10:58.353939+00:00,3,141,0,0,0,30.0,0.761,27,1
766,767,2025-01-22 23:10:58.353939+00:00,1,126,60,0,0,30.1,0.349,47,1


In [42]:
# Importing dependencies
from feast import FeatureStore
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from joblib import dump

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

# Retrieving the saved dataset and converting it to a DataFrame
training_df = training_data.to_df() #store.get_saved_dataset(name="diabetes_dataset").to_df()

# Separating the features and labels
y = training_df['Outcome']
X = training_df.drop(
    labels=['Outcome', 'event_timestamp', "patient_id"], 
    axis=1)

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y)

# Creating and training LogisticRegression
reg = LogisticRegression(max_iter = 200)
reg.fit(X=X_train[sorted(X_train)], y=y_train)

# Saving the model
dump(value=reg, filename="model.joblib")



  klass=_int.DatetimeTZBlock,


['model.joblib']

In [44]:
# Importing dependencies
from feast import FeatureStore
from datetime import datetime, timedelta

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

#store.materialize_incremental(end_date = datetime.now())

store.materialize(start_date=datetime.utcnow() - timedelta(days=530), end_date=datetime.utcnow() - timedelta(days=10))

Materializing [1m[32m1[0m feature views from [1m[32m2023-08-11 20:18:48+03:00[0m to [1m[32m2025-01-12 20:18:48+03:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdiabetes_stats[0m:


0it [00:00, ?it/s]


In [45]:
# Importing dependencies
from feast import FeatureStore
import pandas as pd
from joblib import load

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

# Defining our features names
feast_features = [
        "diabetes_stats:Pregnancies",
        "diabetes_stats:Glucose",
        "diabetes_stats:BloodPressure",
        "diabetes_stats:SkinThickness",
        "diabetes_stats:Insulin",
        "diabetes_stats:BMI",
        "diabetes_stats:DiabetesPedigreeFunction",
        "diabetes_stats:Age",
    ]

# Getting the latest features
features = store.get_online_features(
    features=feast_features,    
    entity_rows=[{"patient_id": 767}, {"patient_id": 766}]
).to_dict()

# Converting the features to a DataFrame
features_df = pd.DataFrame.from_dict(data=features)

In [46]:
features_df.head()


Unnamed: 0,patient_id,SkinThickness,Pregnancies,BloodPressure,Glucose,Age,Insulin,BMI,DiabetesPedigreeFunction
0,767,0.0,1,60.0,126.0,47,0.0,30.1,0.349
1,766,23.0,5,72.0,121.0,30,112.0,26.200001,0.245


# Step 10: Call the predict function and see the output


In [47]:
features_df.drop("patient_id", axis=1)

Unnamed: 0,SkinThickness,Pregnancies,BloodPressure,Glucose,Age,Insulin,BMI,DiabetesPedigreeFunction
0,0.0,1,60.0,126.0,47,0.0,30.1,0.349
1,23.0,5,72.0,121.0,30,112.0,26.200001,0.245


In [48]:
# Loading our model and doing inference
reg = load("model.joblib")
predictions = reg.predict(features_df[sorted(features_df.drop("patient_id", axis=1))])
print(predictions)
prediction_probabilities = reg.predict_proba(features_df[sorted(features_df.drop("patient_id", axis=1))])
print(prediction_probabilities)

[0 0]
[[0.7426289  0.2573711 ]
 [0.81199663 0.18800337]]
