# Tecton Virtual Hands-On Lab
![lab1](/files/tables/lab1.jpg)

In [None]:
# import tecton and other libraries
import os
import tecton
import pandas as pd
from datetime import datetime, timedelta

tecton.set_credentials(tecton_api_key=dbutils.secrets.get(scope='tecton-lab-2', key='TECTON_API_KEY'),tecton_url="https://lab.tecton.ai/api")
ws = tecton.get_workspace('prod')

##Data Sources

First we will be creating a connection to Tecton from a batch data source. Once a data source is defined in Tecton, it can be used to build features. The code below shows allows Tecton to read data on demonstrating example transactions. It was been added to your workspace already.

```
from tecton import BatchSource, FileConfig
from datetime import datetime, timedelta

transactions = BatchSource(
    name="transactions",
    batch_config=FileConfig(
        uri="s3://tecton.ai.public/tutorials/fraud_demo/transactions/data.pq",
        file_format="parquet",
        timestamp_field="timestamp",
    ),
)
```

Use the example above to create a customers BatchSource with
* The **name** as ***customers***
* **uri** being ***s3://tecton.ai.public/tutorials/fraud_demo/customers/data.pq***
* The same **file_format** as above
* The **timestamp_field** of ***signup_timestamp***

In [None]:

## add import statements here


## add BatchSource definition here
customers = BatchSource()

## Features Views
With some data sources defined, we can build features views off them. Feature views are used by Tecton to define how, when, and where it will materialize features. Feature views can be either Batch FeatureViews, Stream Feature Views, or On-Demand Feature Views. Below, is an example of an incomplete Batch Feature View.

```
from tecton import Entity, BatchSource, FileConfig, batch_feature_view, FilteredSource
from datetime import datetime, timedelta

user = Entity(name="user", join_keys=["user_id"])

@batch_feature_view(
    sources=[FilteredSource(customers)],
    entities=[user],
    mode="spark_sql",
    batch_schedule=timedelta(days=1),
    ttl=timedelta(days=3650),
)
def user_credit_card_issuer(customers):
    return f"""
        SELECT
            user_id,
            signup_timestamp,
            CASE SUBSTRING(CAST(cc_num AS STRING), 0, 1)
                WHEN '4' THEN 'Visa'
                ELSE 'other'
            END as credit_card_issuer
        FROM
            {customers}
        """
```

In the empty cell below, redefine the feature to include the cases when a **'5'** appears to be ***'MasterCard'*** and when **'6'** appears to be ***'Discover'***

Afterwords, running the validate() cell will ensure that Tecton can reach the specified data source and it is has the proper schema

In [None]:
#define feature view with new defintion here

In [None]:
user_credit_card_issuer.validate()

# Feature Views with Aggregations
Aggregations can simplify implementations of common powerful features. For this feature view, we will perform a number of different aggregations to show the average and total transaction amounts for each user of given time periods. We will be utilizing just two, but Tecton provides [many different aggregations.](https://docs.tecton.ai/docs/sdk-reference/time-window-aggregation-functions#docusaurus_skipToContent_fallback)

```
from tecton import Entity, BatchSource, FileConfig, batch_feature_view, Aggregation
from datetime import datetime, timedelta

transactions = ws.get_data_source("transactions")


@batch_feature_view(
    description="User transaction metrics over 1, 3 and 7 days",
    sources=[transactions],
    entities=[user],
    mode="spark_sql",
    aggregation_interval=timedelta(days=10),
    aggregations=[
        Aggregation(function="mean", column="amt", time_window=timedelta(days=1)),
        Aggregation(function="mean", column="amt", time_window=timedelta(days=30)),
        Aggregation(function="mean", column="amt", time_window=timedelta(days=365)),
        Aggregation(function="max", column="transaction", time_window=timedelta(days=1)),
        Aggregation(function="max", column="transaction", time_window=timedelta(days=30)),
        Aggregation(function="max", column="transaction", time_window=timedelta(days=365)),
    ],
    online=True,
    offline=True,
    feature_start_time=datetime(2023, 1, 1),
    batch_schedule=timedelta(days=1),
)
def user_transaction_metrics_1(transactions):
    return f"""
        SELECT user_id, timestamp, amt, 1 as transaction
        FROM {transactions}
        """
```

In the empty cell below, create a batch_feature_view with the following changes:

* Change the **aggregation_interval** from *10 days* to ***days=1***.
  * This will match our batch_schedule
* Change the **1 day**, **30 days** and **365 days** intervals to ***days=1***, ***days=3***, ***days=7***
* Change the aggregations using **max** to use ***count***

In [None]:
#define feature with updated aggregations here

In [None]:
user_transaction_metrics_1.validate()

# On-Demand Feature Views
An On-Demand Feature View is used to run row-level, request-time transformations on data from Request Sources, Batch Feature Views, or Stream Feature Views. Unlike Batch and Stream Feature Views, On-Demand Feature Views do not precompute and materialize data to the Feature Store, but instead run transformations both online and offline at the time of the request.

We can build on top of the feature view we just made to compare historical values to requests happening at transaction time. Run the cell below, which defines the input and output Tecton should expect at transaction time.

In [None]:
from tecton import RequestSource
from tecton.types import Float64, Field, Bool

request_schema = [Field('amt', Float64)]
transaction_request = RequestSource(schema=request_schema)
output_schema = [Field('transaction_amount_is_higher_than_7d_average', Bool)]

We can use this new source to create an on_demand_feature_view below.

The @on_demand_feature_view
* will take in 2 **sources** as a list - the incoming **transaction_request** and the previously built feature **user_transaction_metrics_1**.
* The **mode** will be ***'python'***
* The **schema** will be the **output_schema** that was just defined.

We will define the incoming transaction amount being higher that the current 7 day average for a particular user as follows:
```{python}
amount_mean = 0 if user_transaction_metrics_1['amt_mean_7d_1d'] is None else user_transaction_metrics_1['amt_mean_7d_1d']
    return {'transaction_amount_is_higher_than_7d_average': transaction_request['amt'] > amount_mean}
```

Use this as the feature definition and validate the on_demand_feature_view.

In [None]:
from tecton import on_demand_feature_view
from tecton import RequestSource
from tecton.types import Float64, Field, Bool

user_transaction_metrics = ws.get_feature_view('user_transaction_metrics')

@on_demand_feature_view(
   #fill in with parameters described above

)
def transaction_amount_is_higher_than_7d_average(transaction_request, user_transaction_metrics):
    #definition goes here


transaction_amount_is_higher_than_7d_average.validate()

## Feature Services
A Feature Service refernces a set of features which are exposed as an API. It's generally recommended that each machine learning model has an associated Feature Service. We will create the a Feature Service with 3 of the Feature Views we just built:

* user_credit_card_issuer
* user_transaction_metrics_1[['amt_mean_7d_1d']]
* transaction_amount_is_higher_than_7d_average

In [None]:
from tecton import FeatureService

lab_fs = FeatureService(
  name = 'lab_fs',
  features = [
    #add features here

  ]
)

lab_fs.validate()

## go to [lab.tecton.ai](https://lab.tecton.ai)
Tecton automatically builds on top of pre-existing features to materialize this one as new data comes in, orchestrates it together automatically with a the other feature views, and creates and runs the spark and python jobs necessary to keep every feature view updated



## Feature Retrieval and model training
Now that all the features are defined in a feature set, we can generate data to send to machine learning models for training and serving.

get_historical_features() allows us to generate training examples from an offline store, in the second cell below, while the request call to Tecton's API generates features from an online store for faster retrieval at serving time.

We can retrieve a large data set with high throughput from the offline store to train a machine learning model

In [None]:
from tecton import FeatureService

training_events = (
    spark.read.parquet("s3://tecton.ai.public/tutorials/fraud_demo/transactions/")
    .select("user_id", "timestamp", "amt", "is_fraud")
    .limit(1000)
)

fraud_detection_feature_service = FeatureService(
    name="fraud_detection_feature_service", features=[user_transaction_metrics_1]
)
fraud_detection_feature_service.validate()

training_data = fraud_detection_feature_service.get_historical_features(training_events).to_pandas().fillna(0)
display(training_data.head(5))

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


df = training_data.drop(["user_id", "timestamp", "amt"], axis=1)

X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]
X = X.reindex(sorted(X.columns), axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

num_cols = X_train.select_dtypes(exclude=["object"]).columns.tolist()
cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()

num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N/A"), OneHotEncoder(handle_unknown="ignore", sparse=False)
)

full_pipe = ColumnTransformer([("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)])

model = make_pipeline(full_pipe, LogisticRegression(max_iter=1000, random_state=42))

model.fit(X_train, y_train)

y_predict = model.predict(X_test)

print(metrics.classification_report(y_test, y_predict, zero_division=0))


## Feature Retrieval and real-time inference
Features are retrieved from the online store with HTTP requests for low latency. The cell below constructs an example request.

We can pull features from an online store to send to a model in real-time, and get a prediction on whether or not a transaction is fraudulent before it happens

In [None]:
import requests, json


def get_online_feature_data(user_id, amt):
    headers = {"Authorization": "Tecton-key " + dbutils.secrets.get(scope='tecton-lab-2', key='TECTON_API_KEY')}

    request_data = f"""{{
        "params": {{
            "feature_service_name": "fraud_detection_feature_service",
            "join_key_map": {{"user_id": "{user_id}"}},
            "metadata_options": {{"include_names": true}},
            "request_context_map": {{"amt": {amt}}},
            "workspace_name": "prod"
        }}
    }}"""

    online_feature_data = requests.request(
        method="POST",
        headers=headers,
        url=f"https://lab.tecton.ai/api/v1/feature-service/get-features",
        data=request_data,
    )

    online_feature_data_json = json.loads(online_feature_data.text)

    return online_feature_data_json

In [None]:
user_id = "user_502567604689"

feature_data = get_online_feature_data(user_id, 200)

if "result" not in feature_data:
    print("Feature data is not materialized")
else:
    print(feature_data["result"])

In [None]:
import pandas as pd


def get_prediction_from_model(feature_data):
    columns = [f["name"].replace(".", "__") for f in feature_data["metadata"]["features"][2:]]
    columns = [f[:24] + "_1" + f[24:] for f in columns]
    data = [feature_data["result"]["features"][2:]]

    features = pd.DataFrame(data, columns=columns)

    if model.predict(features):
        return "Transaction denied."
    else:
        return "Transaction accepted."

In [None]:
user_id = "user_502567604689"

online_feature_data = get_online_feature_data(user_id, 20)
prediction = get_prediction_from_model(online_feature_data)

print(prediction)