# Qwak Feature Store Example - Batch Feature Set with SQL Transformation

Welcome to the Qwak Feature Store example! In this tutorial, we'll guide you through creating a sample Data Source, transforming it into a Feature Set, and leveraging its features for model training and inference using the Qwak Platform. 

Guides like this one aim to provide you with a starting point by offering a straightforward framework for working with Qwak. However, we encourage you to explore our [comprehensive documentation](https://docs-saas.qwak.com/docs/feature-store-overview) for more detailed and specific information.

Before diving in, make sure you have the Qwak SDK installed and authenticated. If you haven't done so already, follow these steps:

1. [Install the Qwak SDK](https://docs-saas.qwak.com/docs/installing-the-qwak-sdk) - Ensure you have the SDK installed on your local environment.
2. [Authenticate](https://docs-saas.qwak.com/docs/installing-the-qwak-sdk#1-via-qwak-cli) - Authenticate with a new Personal or Service Qwak API Key.

To gain a deeper understanding of Feature Stores and their importance in machine learning workflows, we recommend checking out our comprehensive [documentation](https://docs-saas.qwak.com/docs/feature-store-overview) and our blog article on [What is a Feature Store](https://www.qwak.com/post/what-is-a-feature-store-in-ml). Let's get started!


## Create the S3-based Data Source

In Qwak, a Data Source serves as a configuration object that specifies how to access and fetch your data. It includes metadata such as name and description, connection details to the data store/storage, the query or resource to retrieve, and the relevant time column for indexing time series features.

### Components of a Data Source:

1. **Metadata**: Includes information like name, description, etc.
2. **URL and Connection Details**: Specifies the connection details to the data store/storage.
3. **Query or Resource**: Defines the resource (file, table, view) to retrieve data from.
4. **Time Column**: Specifies the relevant time column for indexing time series features.

In the following example, we'll connect to a publicly accessible S3 bucket and fetch data from a single CSV file, for simplicity.


In [3]:
%%writefile data_source.py

from qwak.feature_store.data_sources import CsvSource
import pandas as pd

# The S3 anonymous config class is required for public S3 buckets
from qwak.feature_store.data_sources import AnonymousS3Configuration

# Create a CsvSource object to represent a CSV data source 
# This example uses a CSV file from a public S3 bucket
csv_source = CsvSource(
    name='credit_risk_data',                                    # Name of the data source
    description='A dataset of personal credit details',         # Description of the data source
    date_created_column='date_created',                         # Column name that represents the creation date
    path='s3://qwak-public/example_data/data_credit_risk.csv',  # S3 path to the CSV file 
    filesystem_configuration=AnonymousS3Configuration(),        # Configuration for anonymous access to S3
    quote_character='"',                                        # Character used for quoting in the CSV file
    escape_character='"'                                        # Character used for escaping in the CSV file
)

Writing data_source.py


### Additional Considerations for Registering Data Sources

When registering Data Sources in Qwak, it's essential to ensure that the underlying data store is accessible by the platform. Depending on your deployment model (Hybrid or SaaS), there are different ways to grant Qwak access to your data.

#### Accessing AWS Resources:

If your data is stored in AWS services, you can grant access to Qwak using an IAM role ARN. For detailed instructions, refer to our documentation on [Accessing AWS Resources with IAM Role](https://docs-saas.qwak.com/docs/accessing-aws-resources-with-iam-role).

#### Using Qwak Secrets:

Alternatively, you can pass the credentials as Qwak Secrets. This approach provides a secure way to manage and authenticate access to your data. For more information, see [Qwak Secret Management](https://docs-saas.qwak.com/docs/secret-management).

For more information about the types of Data Sources supported by Qwak, refer to our documentation:
- [Batch Data Sources](https://docs-saas.qwak.com/docs/batch-data-sources)
- [Streaming Data Sources](https://docs-saas.qwak.com/docs/streaming-data-sources)

<br>

### Sampling Data from the Data Source

It's important to note that the data source cannot be used as a query engine independently (for now). Instead, it serves as a sampling mechanism to verify that the data is being queried properly.


In [2]:
%run data_source.py

df_sample = csv_source.get_sample()
print (f"Data Source Data Types:\n\n{df_sample.dtypes}\n")
print (f"Data Source Sample :\n\n{df_sample.head(7).to_string()}\n")

Data Source Data Types:

age                  int64
sex                 object
job                  int64
housing             object
saving_account      object
checking_account    object
credit_amount        int64
duration             int64
purpose             object
risk                object
user_id             object
date_created         int64
dtype: object

Data Source Sample :

   age     sex  job housing saving_account checking_account  credit_amount  duration              purpose  risk                               user_id   date_created
0   67    male    2     own           None           little           1169         6             radio/TV  good  baf1aed9-b16a-46f1-803b-e2b08c8b47de  1609459200000
1   22  female    2     own         little         moderate           5951        48             radio/TV   bad  574a2cb7-f3ae-48e7-bd32-b44015bf9dd4  1609459200000
2   49    male    1     own         little             None           2096        12            education  good  1b044d

## Registering the Data Source with the Qwak Platform

After verifying that the Data Source returns the desired results, the next step is to register it with the Qwak Platform.

In [2]:
!echo "Y" | qwak features register -p data_source.py

Notice that BatchInferenceClient and FeedbackClient are not available in the skinny package. In order to use them, please install them as extras: pip install "qwak-inference[batch,feedback]".
[K[?25h[34m✅[0m Finding Entities to register (0:00:00.00)
👀 Found 0 Entities
----------------------------------------
[K[?25h[34m✅[0m Finding Data Sources to register (0:00:00.00)
👀 Found 1 Data Sources
Validating 'credit_risk_data' data source
[K[?25h[34m✅[0m  (0:00:06.22)
✅ Validation completed successfully, got data source columns:
column name       type
----------------  ---------
age               int
sex               string
job               int
housing           string
saving_account    string
checking_account  string
credit_amount     int
duration          int
purpose           string
risk              string
user_id           string
date_created      timestamp
Update existing Data Source 'credit_risk_data' from source file '/home/qwak/workspace/data_source.py'?
continue? [y/N

<hr><br>

## Creating the Batch Feature Set from the Data Source

When creating a Feature Set, it typically consists of the following components:

- **Metadata:** Includes name, key, data sources, and the timestamp column used for indexing.
- **Scheduling Expression:** For Batch Feature Sets, this defines when ingestion jobs should run.
- **Cluster Type:** Specifies the resources to use for running the ingestion job.
- **Backfill:** Determines how far back in time the Feature Set should ingest data.
- **Transformation:** Can be SQL-based or UDF-based (currently Koalas) for data transformation.

[Read Policies](https://docs-saas.qwak.com/docs/read-policies) instruct Qwak on which data to fetch from the Data Source. 
- **NewOnly:** Fetches records created after the last ingestion.
- **TimeFrame:** Fetches records within a specified timeframe.
- **FullRead:** Fetches all data from the Data Source in every ingestion job, which can be heavy for main tables but useful for foreign key-based tables.

For this example, we'll use NewOnly since our sample Data Source is static, consisting of a single CSV file.

The execution specification refers to the size of the cluster used for data ingestion. More information can be found in the [Qwak docs](https://docs-saas.qwak.com/docs/instance-sizes#feature-store).


In [1]:
%%writefile batch_feature_set_sql.py

from datetime import datetime
from qwak.feature_store.feature_sets import batch
from qwak.feature_store.feature_sets.transformations import SparkSqlTransformation
from qwak.feature_store.feature_sets.execution_spec import ClusterTemplate
from qwak.feature_store.feature_sets.read_policies import ReadPolicy

@batch.feature_set(
    name="credit-risk-fs-sql", # must contain dashes -, NOT underscores _
    key="user",
    data_sources={"credit_risk_data": ReadPolicy.NewOnly},
    timestamp_column_name="date_created"  # Must be included in transformation output
)
@batch.scheduling(cron_expression="0 0 * * *")
@batch.execution_specification(cluster_template=ClusterTemplate.MEDIUM)
@batch.backfill(start_date=datetime(2019, 12, 31, 0, 0, 0))
def transform():
    return SparkSqlTransformation(sql="""
        SELECT user_id as user,
               age,
               sex,
               job,
               housing,
               saving_account,
               checking_account,
               credit_amount,
               duration,
               purpose,
               date_created
        FROM credit_risk_data
    """)


Overwriting batch_feature_set_sql.py


## Sampling the Data Source and Printing Data and Data Types

If your data source takes more than 5 minutes to query or fetch a sample of the data (for example, due to long-running queries), your sampling process may fail with a timeout error. In such cases, you can skip validation during registration with Qwak and proceed to register your feature set, allowing it to run an ingestion job.

### Note:
The sampling process is essential for verifying that the data is queried properly. However, if it takes too long, you can proceed with the registration without validation and rely on the ingestion job to ensure data correctness.


In [2]:
%run batch_feature_set_sql.py

df_sample = transform.get_sample()
print (f"Data Source Data Types:\n\n{df_sample.dtypes}\n")
print (f"Data Source Sample :\n\n{df_sample}\n")

Data Source Data Types:

user                object
age                  int64
sex                 object
job                  int64
housing             object
saving_account      object
checking_account    object
credit_amount        int64
duration             int64
purpose             object
date_created         int64
dtype: object

Data Source Sample :

                                   user  age     sex  job housing  \
0  baf1aed9-b16a-46f1-803b-e2b08c8b47de   67    male    2     own   
1  574a2cb7-f3ae-48e7-bd32-b44015bf9dd4   22  female    2     own   
2  1b044db3-3bd1-4b71-a4e9-336210d6503f   49    male    1     own   
3  ac8ec869-1a05-4df9-9805-7866ca42b31c   45    male    2    free   
4  aa974eeb-ed0e-450b-90d0-4fe4592081c1   53    male    2    free   
5  7b3d019c-82a7-42d9-beb8-2c57a246ff16   35    male    1    free   
6  6bc1fd70-897e-49f4-ae25-960d490cb74e   53    male    2     own   
7  193158eb-5552-4ce5-92a4-2a966895bec5   35    male    3    rent   
8  759b5b46-dbe9-40e

## Visualizing Data in the Feature Store

The displayed data represents the features stored in the feature store, which will be utilized in our Qwak ML model for both training and inference purposes.

Once we have confirmed that the data appears as expected and meets our requirements, we can proceed with registering the feature set in Qwak.


In [3]:
!echo "Y" | qwak features register -p batch_feature_set_sql.py

Notice that BatchInferenceClient and FeedbackClient are not available in the skinny package. In order to use them, please install them as extras: pip install "qwak-inference[batch,feedback]".
[K[?25h[34m✅[0m Finding Entities to register (0:00:00.07)
👀 Found 0 Entities
----------------------------------------
[K[?25h[34m✅[0m Finding Data Sources to register (0:00:00.01)
👀 Found 0 Data Sources
----------------------------------------
[K[?25h[34m✅[0m Finding Feature Sets to register (0:00:00.01)
👀 Found 1 Feature Set(s)
Create new feature set 'credit-risk-fs-sql' from source file '/home/qwak/workspace/batch_feature_set_sql.py'?
continue? [y/N]: Validating 'credit-risk-fs-sql' feature set
[K[?25h[34m✅[0m  (0:00:06.20)
✅ Validation completed successfully, got data source columns:
column name       type
----------------  ---------
user              string
age               int
sex               string
job               int
housing           string
saving_account    string
che

<br>

#### Verifying Feature Set Registration

To ensure that the Feature Set has been successfully registered and is valid, execute the following command to list all Feature Sets associated with your Qwak account:

<br>

In [None]:
!qwak features list

<br>

For more information on the available Feature Store SDK commands, please use the CLI help:

<br>

In [15]:
!qwak features --help

Notice that BatchInferenceClient and FeedbackClient are not available in the skinny package. In order to use them, please install them as extras: pip install "qwak-inference[batch,feedback]".
Usage: qwak features [OPTIONS] COMMAND [ARGS]...

  Commands for interacting with the Qwak Feature Store

Options:
  --help  Show this message and exit.

Commands:
  backfill          Trigger a backfill process for a Feature Set
  delete            Delete by name a feature store object - a feature...
  execution-status  Retrieve the current status of an execution...
  list              List registered feature sets
  pause             Pause a running feature set
  register          Register and deploy all feature store object under...
  resume            Resume a paused feature set
  trigger           Trigger a batch feature set job ingestion job


<hr><br>

## Consuming Features from the Offline Feature Store (Training/Batch Inference)

To retrieve features from the Offline Feature Store for training or batch inference, you can use two methods:

1. **By List of IDs and Timestamp**:
   - Fetches records associated with the provided set of keys, inserted at a specific timestamp.
   - Query date must fall between the start and end timestamp.

2. **By Date Range**:
   - Retrieves all records within the specified date range.
   - May include multiple records per key for time series data.


For simplicity we will focus on the second option, but you can find more information on the first one in [our docs](https://docs-saas.qwak.com/docs/getting-features-for-training#get-feature-values). 

In [1]:
# Importing the Feature Store clients used to fetch results
from qwak.feature_store.offline import OfflineClientV2
from qwak.feature_store.offline.feature_set_features import FeatureSetFeatures

from datetime import datetime
import pandas as pd

def fetch_training_features(start_time: datetime, end_time: datetime) -> pd.DataFrame: 

    offline_feature_store = OfflineClientV2()
    
    features = FeatureSetFeatures(
        feature_set_name='credit-risk-fs-sql',
        feature_names=['age', 'sex', 'job', 'housing', 'saving_account', 'checking_account', 'credit_amount', 'duration', 'purpose']
    )
    
    # It's recommended to be surrounded in a try/catch
    features: pd.DataFrame = offline_feature_store.get_feature_range_values(
        features=features,
        start_date=start_time,
        end_date=end_time
    )

    return features
    

if __name__ == '__main__':

    # Define the date range for feature retrieval
    feature_range_start = datetime(year=2023, month=1, day=1)
    feature_range_end = datetime.today()

    train_df = fetch_training_features(feature_range_start, feature_range_end)

    print(f"\n\nTraining data sample:\n\n{train_df.head().to_string()}\n")



Training data sample:

                                   user                    date_created  credit-risk-fs-sql.age credit-risk-fs-sql.sex  credit-risk-fs-sql.job credit-risk-fs-sql.housing credit-risk-fs-sql.saving_account credit-risk-fs-sql.checking_account  credit-risk-fs-sql.credit_amount  credit-risk-fs-sql.duration credit-risk-fs-sql.purpose
0  45b7836f-bf7c-4039-bc9e-d33982cc1fc5  2021-01-01 00:00:00.000000 UTC                      27                   male                       2                        own                          moderate                            moderate                              4576                           45                        car
1  45b7836f-bf7c-4039-bc9e-d33982cc1fc5  2023-03-20 23:00:00.000000 UTC                      27                   male                       2                        own                          moderate                            moderate                              4576                           45             

<br>

Please note that although the Feature Set has been registered, it usually takes a couple of minutes to run the first ingestion job. This means you might not have any data to fetch until the ingestion job runs at least once.

To verify the status of the ingestion, please refer to the Qwak Dashboard -> Feature Sets -> `credit-risk-fs-sql` -> Jobs.

![Feature Store Dashboard](PNGs/ingestion-job-finished.png)


<br>

<hr><br>

## Consuming Features for Real-Time Inference from the Online Store

In the previous example, we retrieved historical data from the Offline Store, which is storing all the historical data. Now, we'll use the Online Store, which is optimized for real-time use-cases and provides a low-latency feature retrieval mechanism. 
Qwak provides two ways to query the Online store and look up the most recent feature vector for a given key:

###  1. Enriching Inference Requests with Features from Online Store

Qwak natively integrates the Model runtime with the Feature Store, offering an easy way to leverage very low-latency feature retrieval. This is done without specifically running a query, just by sending the feature set key in the model request input. This will automatically extract the latest features for that `key`, in our case `user` during a model serving request.


Note: Below is an example code for local use only. If you're using it for a live model, please remove the `run_local` import.

**The ModelSchema definition is mandatory to enable feature extraction via the OnlineClient or qwak.api decorator**.


In [2]:
from qwak.model.tools import run_local # utility tooling for local testing and debugging - REMOVE BEFORE BUILDING REMOTELY

from qwak.model.base import QwakModel
from qwak.model.adapters import DefaultOutputAdapter, DataFrameInputAdapter
from qwak.model.schema import ModelSchema, InferenceOutput
from qwak.model.schema_entities import FeatureStoreInput
import pandas as pd
import qwak

FEATURE_SET = 'credit-risk-fs-sql'

class CreditRiskModel(QwakModel):
   
    def __init__(self):
        pass

    def build(self):
        pass

    def schema(self) -> ModelSchema:
        model_schema = ModelSchema(
            inputs=[
                FeatureStoreInput(name=f'{FEATURE_SET}.checking_account'),
                FeatureStoreInput(name=f'{FEATURE_SET}.age'),
                FeatureStoreInput(name=f'{FEATURE_SET}.job'),
                FeatureStoreInput(name=f'{FEATURE_SET}.duration'),
                FeatureStoreInput(name=f'{FEATURE_SET}.credit_amount'),
                FeatureStoreInput(name=f'{FEATURE_SET}.housing'),
                FeatureStoreInput(name=f'{FEATURE_SET}.purpose'),
                FeatureStoreInput(name=f'{FEATURE_SET}.saving_account'),
                FeatureStoreInput(name=f'{FEATURE_SET}.sex'),
            ],
            outputs=[InferenceOutput(name="credit_score", type=float)]
        )
        return model_schema

    @qwak.api(
        feature_extraction=True,
        input_adapter=DataFrameInputAdapter(),
        output_adapter=DefaultOutputAdapter()
    )
    def predict(self, df: pd.DataFrame, extracted_df: pd.DataFrame) -> pd.DataFrame:
        print(f"\nInput dataframe df:\n{df}")
        print(f"\nFeature Set extracted dataframe:\n{extracted_df.to_string()}")
        return pd.DataFrame([['score', 0.5]])


Calling the model locally to test `predict()`:

In [7]:
def test_model_locally():
    # Create a new instance of the model
    m = CreditRiskModel()

    # Define the columns
    columns = ["user"]

    # Define the data
    data = [["45b7836f-bf7c-4039-bc9e-d33982cc1fc5"]]

    
    # Create the DataFrame and convert it to JSON
    json_payload = pd.DataFrame(data, columns=columns).to_json()
    print("Predicting for: \n\n", json_payload)
    

    # Run local inference using the model and print the prediction
    # The run_local function is part of the qwak library and allows for local testing of the model
    prediction = run_local(m, json_payload)
    print("\nPrediction: ", prediction)

test_model_locally()

Predicting for: 

 {"user":{"0":"45b7836f-bf7c-4039-bc9e-d33982cc1fc5"}}

Input dataframe df:
                                   user
0  45b7836f-bf7c-4039-bc9e-d33982cc1fc5

Feature Set extracted dataframe:
                                   user credit-risk-fs-sql.checking_account  credit-risk-fs-sql.age  credit-risk-fs-sql.job  credit-risk-fs-sql.duration  credit-risk-fs-sql.credit_amount credit-risk-fs-sql.housing credit-risk-fs-sql.purpose credit-risk-fs-sql.saving_account credit-risk-fs-sql.sex
0  45b7836f-bf7c-4039-bc9e-d33982cc1fc5                            moderate                      27                       2                           45                              4576                        own                        car                          moderate                   male

Prediction:  [{"0":"score","1":0.5}]


<br>
As you can see, the we only sent the `user` ID in the prediction request, and Qwak automatically extracted the relevant (latest) features for that key from the Feature Set specified in the Model Schema. 

This approach is automatically logging the extraction latency to the model Analytics.

<br>

###  2. Features Lookup with the OnlineClient

With the previous approach we managed to enable a QwakModel to fetch features automatically and that approach is great for most cases. However what happens if we want to have more control over the keys we want to look up for at runtime, like for example looking up multiple keys for a single prediction request input. 

That's what the `OnlineClient` is for, to enable you explicit queries, as we'll exemplify below:

<br>

In [9]:
import pandas as pd
from qwak.feature_store.online.client import OnlineClient
from qwak.model.schema_entities import FeatureStoreInput
from qwak.model.schema import ModelSchema

FEATURE_SET = 'credit-risk-fs-sql'


model_schema = ModelSchema(
            inputs=[
                FeatureStoreInput(name=f'{FEATURE_SET}.checking_account'),
                FeatureStoreInput(name=f'{FEATURE_SET}.age'),
                FeatureStoreInput(name=f'{FEATURE_SET}.job'),
                FeatureStoreInput(name=f'{FEATURE_SET}.duration'),
                FeatureStoreInput(name=f'{FEATURE_SET}.credit_amount'),
                FeatureStoreInput(name=f'{FEATURE_SET}.housing'),
                FeatureStoreInput(name=f'{FEATURE_SET}.purpose'),
                FeatureStoreInput(name=f'{FEATURE_SET}.saving_account'),
                FeatureStoreInput(name=f'{FEATURE_SET}.sex'),
            ],
            outputs=[InferenceOutput(name="credit_score", type=float)]
        )
    
online_client = OnlineClient()

df = pd.DataFrame(columns=['user',],
                  data   =[['06cc255a-aa07-4ec9-ac69-b896ccf05322'],
                           ['46ad9e4b-1d0f-47b7-a73d-71cc66538b03'],
                           ['95ec0c53-4e27-4490-b85f-1448de70fc26']])
                  
online_features = online_client.get_feature_values(model_schema, df)


print(f"\n\Realtime features extracted:\n\n{online_features.to_string()}\n")


\Realtime features extracted:

                                   user credit-risk-fs-sql.checking_account  credit-risk-fs-sql.age  credit-risk-fs-sql.job  credit-risk-fs-sql.duration  credit-risk-fs-sql.credit_amount credit-risk-fs-sql.housing credit-risk-fs-sql.purpose credit-risk-fs-sql.saving_account credit-risk-fs-sql.sex
0  06cc255a-aa07-4ec9-ac69-b896ccf05322                            moderate                      31                       2                           24                              1935                        own                   business                            little                   male
1  46ad9e4b-1d0f-47b7-a73d-71cc66538b03                            moderate                      23                       0                            6                             14555                        own                        car                              null                   male
2  95ec0c53-4e27-4490-b85f-1448de70fc26                            moderat

<br>

You may have noticed that the FeatureStoreInput names contain both the feature set name and the feature name. This design allows you to specify and utilize multiple feature sets within the same request.

Similar to the previous option, the `ModelSchema` is a required component. It informs Qwak about the features to include in the lookup.
