<a href="https://colab.research.google.com/github/feast-dev/feast-driver-ranking-tutorial/blob/master/notebooks/Driver_Ranking_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Overview
Making a prediction using a linear regression model is a common use case in ML. In this guide tutorial, we build the model that predicts if a driver will complete a trip based on a number of features ingested into Feast.

The basic local mode gives you ability to quickly try Feast, while the advanced mode shows how you can use Feast in a production setting, in particular for the Google Cloud Platform (GCP) cloud.

This tutorial uses Feast with scikit learn to:

* Train a model locally using data from BigQuery
* Test the model for online inference using SQLite (for fast iteration)
* Test the model for online inference using Firestore (to represent production)
 

## Step 1: Install feast, scikit-learn

Install feast, gcp dependencies and scikit-learn


In [1]:
!pip install feast scikit-learn 'feast[gcp]'

Collecting scikit-learn
  Downloading scikit_learn-1.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[K     |████████████████████████████████| 26.7 MB 11.9 MB/s eta 0:00:01
Collecting scipy>=1.1.0
  Downloading scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
[K     |████████████████████████████████| 41.6 MB 12.3 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 4.6 MB/s eta 0:00:01
[?25hCollecting google-cloud-bigquery-storage>=2.0.0
  Downloading google_cloud_bigquery_storage-2.12.0-py2.py3-none-any.whl (179 kB)
[K     |████████████████████████████████| 179 kB 12.1 MB/s eta 0:00:01
[?25hCollecting google-cloud-core<2.0.0,>=1.4.0
  Downloading google_cloud_core-1.7.2-py2.py3-none-any.whl (28 kB)
Collecting google-cloud-storage<1.41,>=1.

Collecting google-resumable-media<3.0dev,>=0.6.0
  Using cached google_resumable_media-2.3.1-py2.py3-none-any.whl (76 kB)
Collecting grpcio-status<2.0dev,>=1.33.2
  Downloading grpcio_status-1.44.0-py3-none-any.whl (10.0 kB)
Collecting google-cloud-core<2.0.0,>=1.4.0
  Downloading google_cloud_core-1.7.1-py2.py3-none-any.whl (28 kB)
  Downloading google_cloud_core-1.7.0-py2.py3-none-any.whl (28 kB)
  Downloading google_cloud_core-1.6.0-py2.py3-none-any.whl (28 kB)
  Downloading google_cloud_core-1.5.0-py2.py3-none-any.whl (27 kB)
  Downloading google_cloud_core-1.4.4-py2.py3-none-any.whl (27 kB)
  Downloading google_cloud_core-1.4.3-py2.py3-none-any.whl (27 kB)
  Downloading google_cloud_core-1.4.2-py2.py3-none-any.whl (26 kB)
  Downloading google_cloud_core-1.4.1-py2.py3-none-any.whl (26 kB)
INFO: pip is looking at multiple versions of google-cloud-bigquery-storage to determine which version is compatible with other requirements. This could take a while.
Collecting google-cloud-bigque

[?25hCollecting cachetools<6.0,>=2.0.0
  Downloading cachetools-4.2.4-py3-none-any.whl (10 kB)
Collecting google-resumable-media<3.0dev,>=0.6.0
  Downloading google_resumable_media-1.3.3-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 7.4 MB/s  eta 0:00:01
[?25hCollecting google-crc32c<2.0dev,>=1.0
  Downloading google_crc32c-1.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (37 kB)
Installing collected packages: cachetools, google-auth, google-crc32c, google-api-core, google-resumable-media, google-cloud-core, threadpoolctl, scipy, joblib, google-cloud-storage, google-cloud-datastore, google-cloud-bigquery-storage, google-cloud-bigquery, scikit-learn
  Attempting uninstall: cachetools
    Found existing installation: cachetools 5.0.0
    Uninstalling cachetools-5.0.0:
      Successfully uninstalled cachetools-5.0.0
  Attempting uninstall: google-auth
    Found existing installation: google-auth 2.6.0
    Uninstalling google-auth-2.6.0:
    

#### Check feast version

In [2]:
!feast version 

Feast SDK Version: "feast 0.19.2"


## Step 2: Clone the Git repo

Clone the Driver Ranking Git repo into your Colab Folder

In [3]:
!git clone https://github.com/feast-dev/feast-driver-ranking-tutorial.git

Cloning into 'feast-driver-ranking-tutorial'...
remote: Enumerating objects: 65, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 65 (delta 26), reused 43 (delta 14), pack-reused 0[K
Unpacking objects: 100% (65/65), 21.29 KiB | 396.00 KiB/s, done.


## Step 3: Set up your Goggle Cloud Platform (GCP) Configurations

## Authenticate into GCP
This will allow you to do the advanced section of the tutorial, where you materialize remotely on a GCP
Feast spins up infrastructure on GCP using the credentials in our environment. Run the following cell to log into GCP:

In [4]:
from google.colab import auth
auth.authenticate_user()

ModuleNotFoundError: No module named 'google.colab'

Set configurations
Set the following configuration, which we'll be using throughout the tutorial:

PROJECT_ID: Your project.
BUCKET_NAME: The name of a bucket which will be used to store the feature store registry and model artifacts.
BIGQUERY_DATASET_NAME: The name of a dataset which will be used to create tables containing features.
AI_PLATFORM_MODEL_NAME: The name of a model name which will be created in AI Platform.

In [7]:
PROJECT_ID= "kf-feast" #@param {type:"string"}
BUCKET_NAME= "driver_ranking_tutorial" #@param {type:"string"} custom
BIGQUERY_DATASET_NAME="feast_driver_ranking_tutorial" #@param {type:"string"} custom
AI_PLATFORM_MODEL_NAME="feast_driver_rankin_jsd_model" #@param {type:"string"

! gcloud config set project $PROJECT_ID
%env GOOGLE_CLOUD_PROJECT=$PROJECT_ID
!echo project_id = $PROJECT_ID > ~/.bigqueryrc

Updated property [core/project].
env: GOOGLE_CLOUD_PROJECT=kf-feast


In [8]:
# Only run if your bucket doesn't already exist!
! gsutil mb gs://$BUCKET_NAME

Creating gs://driver_ranking_tutorial/...


## Step 4: Apply and deploy feature definitions

`feast apply` scans python files in the current directory for feature definitions and deploys infrastructure according to `feature_store.yaml`

In [9]:
%%shell
cd /content/feast-driver-ranking-tutorial/driver_ranking/
feast apply

Registered entity [1m[32mdriver_id[0m
Registered feature view [1m[32mdriver_hourly_stats[0m
Deploying infrastructure for [1m[32mdriver_hourly_stats[0m




### Inspect the files created under your local folder

In [10]:
%%shell
cd /content/feast-driver-ranking-tutorial/driver_ranking/data/
ls -l 

total 20
-rw-r--r-- 1 root root 16384 Jul 26 20:43 online.db
-rw-r--r-- 1 root root   310 Jul 26 20:43 registry.db




## Step 5: Train your model

In [13]:
import feast
from joblib import dump
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load driver order data
orders = pd.read_csv("/content/feast-driver-ranking-tutorial/driver_orders.csv", sep="\t")
orders["event_timestamp"] = pd.to_datetime(orders["event_timestamp"])

# Connect to your feature store provider
fs = feast.FeatureStore(repo_path="/content/feast-driver-ranking-tutorial/driver_ranking")
        
# Retrieve training data from BigQuery
training_df = fs.get_historical_features(
    entity_df=orders,
    feature_refs=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

# Train model
target = "trip_completed"

reg = LinearRegression()
train_X = training_df[training_df.columns.drop(target).drop("event_timestamp")]
train_Y = training_df.loc[:, target]
reg.fit(train_X[sorted(train_X)], train_Y)

# Save model
dump(reg, "driver_model.bin")

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column                                Non-Null Count  Dtype              
---  ------                                --------------  -----              
 0   event_timestamp                       10 non-null     datetime64[ns, UTC]
 1   driver_id                             10 non-null     int64              
 2   trip_completed                        10 non-null     int64              
 3   driver_hourly_stats__conv_rate        10 non-null     float64            
 4   driver_hourly_stats__acc_rate         10 non-null     float64            
 5   driver_hourly_stats__avg_daily_trips  10 non-null     int64              
dtypes: datetime64[ns, UTC](1), float64(2), int64(3)
memory usage: 608.0 bytes
None

----- Example features -----

            event_timestamp  ...  driver_hourly_stats__avg_daily_trips
0 2021-04-17 04:29:28+00:00  ...                     

['driver_model.bin']

## Step 6: Materialize your online store
Apply and materialize data to Firestore

In [14]:
!cd /content/feast-driver-ranking-tutorial/driver_ranking/ && feast materialize-incremental 2022-01-01T00:00:00

Materializing [1m[32m1[0m feature views to [1m[32m2022-01-01 00:00:00+00:00[0m into the [1m[32mdatastore[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2020-07-27 20:45:14+00:00[0m to [1m[32m2022-01-01 00:00:00+00:00[0m:
100%|███████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.16it/s]


### Step 7:  Make Prediction

In [19]:
import pandas as pd
import feast
from joblib import load


class DriverRankingModel:
    def __init__(self):
        # Load model
        self.model = load("/content/driver_model.bin")

        # Set up feature store
        self.fs = feast.FeatureStore(repo_path="/content/feast-driver-ranking-tutorial/driver_ranking/")

    def predict(self, driver_ids):
        # Read features from Feast
        driver_features = self.fs.get_online_features(
            entity_rows=[{"driver_id": driver_id} for driver_id in driver_ids],
            features=[
                "driver_hourly_stats:conv_rate",
                "driver_hourly_stats:acc_rate",
                "driver_hourly_stats:avg_daily_trips",
            ],
        )
        df = pd.DataFrame.from_dict(driver_features.to_dict())

        # Make prediction
        df["prediction"] = self.model.predict(df[sorted(df)])

        # Choose best driver
        best_driver_id = df["driver_id"].iloc[df["prediction"].argmax()]

        # return best driver
        return best_driver_id

In [20]:
def make_drivers_prediction():
    drivers = [1001, 1002, 1003, 1004]
    model = DriverRankingModel()
    best_driver = model.predict(drivers)
    print(f"Prediction for best driver id: {best_driver}")

In [21]:
make_drivers_prediction()

Prediction for best driver id: 1001
