In [2]:
!pip install feast scikit-learn pandas pyarrow

Collecting feast
  Downloading feast-0.48.0-py2.py3-none-any.whl.metadata (37 kB)
Collecting colorama<1,>=0.3.9 (from feast)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting dill~=0.3.0 (from feast)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting mmh3 (from feast)
  Downloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting numpy<2,>=1.22 (from feast)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow
  Downloading pyarrow-17.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting pydantic==2.10.6 (from feast)
  Downloading pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting tenacity<9,>=7 (from feast)
  Downloading tenacity-8.5.0

In [3]:
!feast init feature_repo



Creating a new Feast repository in [1m[32m/content/feature_repo[0m.



creates a directory named feature_repo containing:​

feature_store.yaml: Configuration file for your feature store.​

Example Feature Definitions: Sample Python files illustrating feature definitions.

In [5]:
%cd feature_repo


/content/feature_repo/feature_repo


In [6]:
!feast apply

  driver = Entity(name="driver", join_keys=["driver_id"])
Applying changes for project feature_repo
[1m[94mNo changes to registry
[1m[94mNo changes to infrastructure


In [17]:
#Load and Explore the Data
import pandas as pd

df = pd.read_parquet('/content/feature_repo/feature_repo/data/driver_stats.parquet')

df.head()

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
0,2025-03-24 10:00:00+00:00,1005,0.655618,0.291158,806,2025-04-08 10:24:01.908
1,2025-03-24 11:00:00+00:00,1005,0.849991,0.882946,750,2025-04-08 10:24:01.908
2,2025-03-24 12:00:00+00:00,1005,0.603516,0.368824,946,2025-04-08 10:24:01.908
3,2025-03-24 13:00:00+00:00,1005,0.727313,0.559777,500,2025-04-08 10:24:01.908
4,2025-03-24 14:00:00+00:00,1005,0.57879,0.694617,537,2025-04-08 10:24:01.908


In [18]:
len(df)

1807

Generate a Training Dataset from Feast

In [19]:
from feast import FeatureStore
from datetime import datetime

store = FeatureStore(repo_path=".")

# Define entity dataframe
entity_df = df[['driver_id', 'event_timestamp']].drop_duplicates().sample(n=150, random_state=42).copy()

entity_df = entity_df.reset_index(drop=True)


# Retrieve historical features from Feast
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips"
    ],
).to_df()

# Show the training data
training_df.head()


Unnamed: 0,driver_id,event_timestamp,conv_rate,acc_rate,avg_daily_trips
0,1002,2025-03-24 11:00:00+00:00,0.059025,0.477846,984
1,1001,2025-03-24 13:00:00+00:00,0.182314,0.836814,465
2,1001,2025-03-24 18:00:00+00:00,0.622156,0.589965,555
3,1003,2025-03-25 00:00:00+00:00,0.143273,0.180927,638
4,1003,2025-03-25 08:00:00+00:00,0.390656,0.892241,809


Let's create a synthetic label to simulate an ML use case (e.g., predicting driver performance):

In [20]:
import numpy as np

# Add a synthetic binary target variable for demonstration
np.random.seed(42)
training_df['high_performance'] = np.random.choice([0, 1], size=len(training_df))

# Features and labels
X = training_df[["conv_rate", "acc_rate", "avg_daily_trips"]]
y = training_df["high_performance"]


In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.43333333333333335
              precision    recall  f1-score   support

           0       0.35      0.50      0.41        12
           1       0.54      0.39      0.45        18

    accuracy                           0.43        30
   macro avg       0.45      0.44      0.43        30
weighted avg       0.46      0.43      0.44        30



Materialize to Online Store

In [22]:
!feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")


Materializing [1m[32m2[0m feature views to [1m[32m2025-04-08 12:26:27+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2025-04-07 12:26:33+00:00[0m to [1m[32m2025-04-08 12:26:27+00:00[0m:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 375.34it/s]
[1m[32mdriver_hourly_stats_fresh[0m from [1m[32m2025-04-07 12:26:33+00:00[0m to [1m[32m2025-04-08 12:26:27+00:00[0m:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 581.09it/s]


This tells Feast:
➤ “Take all new feature data since the last materialization point and load it into the online store, up to the current UTC timestamp.”

It ensures that your online feature store is up to date, especially for features computed periodically (e.g., hourly/daily aggregates).

In [24]:
# Retrieve latest features for a driver
online_features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips"
    ],
    entity_rows=[{"driver_id": int(entity_df.iloc[0]['driver_id'])}]
).to_dict()

# Prepare for prediction
online_X = pd.DataFrame.from_dict(online_features).drop(columns=["driver_id"])
expected_columns = ["conv_rate", "acc_rate", "avg_daily_trips"]
online_X = online_X[expected_columns]

# Predict
real_time_prediction = model.predict(online_X)

print("Real-time feature vector:\n", online_X)
print("Predicted High Performance?", real_time_prediction[0])


Real-time feature vector:
    conv_rate  acc_rate  avg_daily_trips
0   0.974777  0.451735              500
Predicted High Performance? 1


With Feast:
You registered features once using feast apply.

You retrieved training data with automatic point-in-time joins using get_historical_features().

You served features in real-time with low latency using get_online_features().

You didn’t have to build separate pipelines for training vs. inference – Feast took care of that.

Using Feast introduces a standardized, scalable, and reliable way to manage ML features. It drastically reduces technical debt, operational risk, and duplicate work. It’s especially valuable in real-world production ML systems where data freshness, feature consistency, and collaboration are critical.

This is why Feast is often considered an essential part of the modern MLOps stack.