# Lesson 1 - End-To-End Pipeline
This lesson will review the workflow of creating a feature view in Tecton, testing it, and pushing it to your Tecton instance when you're done.

In lesson you will:
* Build an end-to-end feature pipelines using the core Tecton components:
  * Data Source
  * Entity
  * Feature View
  * Feature Service
* Query a Tecton feature service to extract features for both training and inference

### 🔑 Concept: 5 Ways to Interact with Tecton

There are 5 ways of interacting with Tecton:

1. **The Tecton Web UI:** The Tecton Web UI is where you can browse, discover, and monitor all of the data sources, features, and more that have been registered with the cluster using `tecton apply`. This is where you can discover other features in your organization that may be helpful for you model or check on the materialization statuses of new features.
2. **The Tecton SDK with Notebook Driven Development:** Tecton's SDK allows you to rapidly experiment with new features views and data sources within a notebook environment.  Tecton's SDK can be used in any EMR or Databricks notebook (like this one!) to fetch data from the Feature Store. This includes things like previewing feature data, testing transformations, and building training data sets. Currently the SDK requires a Spark Context, but soon we will offer support for using the Tecton SDK from a local notebook without Spark.
3. **The Tecton Feature Repo and CLI:** Data sources, features, and feature sets are all defined as python configuration files in a local "Feature Repo" typically backed by git (such as the one you cloned earlier). These definitions are then applied to a workspace in a Tecton cluster using the CLI command `tecton apply`. This will be one of the most common CLI commands you will use.
4. **The Tecton REST API:** Tecton's REST API is used for fetching the latest feature values in production for model inference. This endpoint typically returns a feature vector in ~5 milliseconds.
5. **The Tecton API Client:** Tecton's Python and Java API clients make using the REST API easy.

In [None]:
!pip install –pre ‘tecton[rift, snowflake]’

# 0. Initialize your session
## Logging into Tecton (and Snowflake)

In [None]:
import tecton
import logging
import os
import pandas as pd
import snowflake.connector
from datetime import datetime, timedelta
from pprint import pprint

connection_parameters = {
    "user": "YOUR_USER",
    "password": "YOUR_PASSWORD",
    "account": "tectonpartner",
    "warehouse": "NAB_WH",
    # Database and schema are required to create various temporary objects by tecton
    "database": "NAB_MFT_DB",
    "schema": "PUBLIC",
}
conn = snowflake.connector.connect(**connection_parameters)
tecton.snowflake_context.set_connection(conn) # Tecton will use this Snowflake connection for all interactive queries

In [None]:
tecton.login('https://demo-nebula.tecton.ai/')

# 1. Create your first feature pipeline

## A) Create Data Source

Data sources define a connection to a batch, stream, push, or request data source (i.e. request-time parameters) and are used as inputs to feature pipelines, known as "Feature Views" in Tecton.


You have 3 options when developing a [data source](https://docs.tecton.ai/docs/defining-features/data-sources):

  1. Create a data source with dummy data,
  2. Create a data source connected to [actual data (s3, Snowflake etc.)](https://docs.tecton.ai/docs/setting-up-tecton/connecting-data-sources/connecting-data), or
  3. Reference and reuse a data source that has already been defined and registered to Tecton.

For this example, we will create a Snowflake data source to connect to a Snowflake Table, but see sample code for options 1 and 3 below:

**- Create a data source with dummy data:**
```python
from tecton import BatchSource, pandas_batch_config
from datetime import datetime 

sample_data = [{
    "user_id": "12345",
    "timestamp": datetime(2023, 3, 1),
    "amt": 100,
    "product": "A"
}, {
    "user_id": "12345",
    "timestamp": datetime(2023, 2, 1, 15),
    "amt": 200,
    "product": "B"
}, {
    "user_id": "12345",
    "timestamp": datetime(2023, 1, 1),
    "amt": 300,
    "product": "Q"
}, {
    "user_id": "54321",
    "timestamp": datetime(2023, 3, 1, 15),
    "amt": 200,
    "product": "C"
}, {
    "user_id": "54321",
    "timestamp":datetime(2023, 2, 1, 12),
    "amt": 200,
    "product": "D"
}]

@pandas_batch_config(supports_time_filtering=True)
def dummy_function(filter_context):
    import pandas as pd
    df = pd.DataFrame(sample_data)
    if filter_context:
        if filter_context.start_time:
            df = df[df["timestamp"] >= filter_context.start_time]
        if filter_context.end_time:
            df = df[df["timestamp"] < filter_context.end_time]
    return df

transactions_dummy = BatchSource(
    name='transactions_dummy',
    batch_config=dummy_function,
)
```

**- Reference an already created data source:**
```python
ws = tecton.get_current_workspace()
transactions_batch = ws.get_data_source("transactions_batch")
```

#### 1. Create a data source from a Snowflake table
Table name: **TECTON_DEMO_DATA.FRAUD_DEMO.TRANSACTIONS_EXT**


In [None]:
from tecton import SnowflakeConfig, BatchSource

snowflake_config = SnowflakeConfig(
    url="https://tectonpartner.snowflakecomputing.com/",
    database="NAB_MFT_DB",
    schema="PUBLIC",
    warehouse="NAB_WH",
    table="TRANSACTIONS",
    timestamp_field="TIMESTAMP"
)

transactions = BatchSource(
    name='transactions',
    batch_config=snowflake_config,
)

transactions.validate()

#### 2. Inspect data source using .get_dataframe()
Once a Tecton data source is defined, you can read data from the data source into a pandas DataFrame using .get_dataframe().to_pandas(). 
This is typically helpful when developing and testing data sources.

In [None]:
from datetime import datetime 

transactions.get_dataframe().to_pandas().head()

### Review Questions:
* What is a Tecton [data source](https://docs.tecton.ai/docs/defining-features/data-sources)?  
* Where is the Tecton documentation found?  
* Where is the documentation of how to read from various sources of data in Tecton?  
* (Advanced) How can you read from data in a format or location not directly supported by a Tecton FileConfig?  
* (Advanced) What are the 4 types of Tecton DataSources?  

## B) Create the entity and feature view logic

Now that we have our data source created, we can now create our feature view, which contains both the transformation logic we want to run on our data source as well as orchestration parameters.

An **entity** tells Tecton what the join key is, in this case the `USER_ID` column. Tecton will check to make sure this column(s) exist after running our transformation logic, so we need to make sure we return a `USER_ID` column from our feature view

There are two ways to reference an entity. We will be manually creating a new entity, but you can also pull already created Entities from the Tecton cluster

```python
import tecton
ws = tecton.get_workspace('prod')
user = ws.get_entity('fraud_user')
```

#### Creating a local Entity

In [None]:
from tecton import Entity

user = Entity(
    name='fraud_user',
    join_keys=['USER_ID']
)

Feature Views take in data sources as inputs, or in some cases other Feature Views, and define a transformation to compute one or more features. Feature Views also provide Tecton with additional information such as metadata and orchestration, serving, and monitoring configurations. There are three types of Feature Views, each designed to support a common data flow pattern **(Batch, Streaming, On-Demand)**.

**Feature Views** are defined by adding a decorator (e.g **@batch_feature_view**) on top of a Python function.  We specify the:
* Data source,
* Entity,
* Transformation (defined inline or reused from other feature views),
* Configuration parameters controlling where and how frequently Tecton materializes the data into the feature store.

In [None]:
from tecton import Entity, BatchSource, FileConfig, batch_feature_view, Aggregation
from tecton.types import Field, String, Timestamp, Float64
from datetime import datetime, timedelta

@batch_feature_view(
    description="User transaction metrics over 1, 3 and 7 days",
    sources=[transactions],
    entities=[user],
    mode="pandas",
    aggregation_interval=timedelta(days=1),
    aggregations=[
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=1)),
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=3)),
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=7)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=1)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=3)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=7)),
    ],
    schema=[Field("USER_ID", String), Field("TIMESTAMP", Timestamp), Field("AMT", Float64)],
)
def user_transaction_metrics(transactions):
    return transactions[["USER_ID", "TIMESTAMP", "AMT"]]


# After we define local objects, we use `.validate()` to check the correctness of the definition and make it ready to query
user_transaction_metrics.validate()

### Test the feature view

Now that our feature View is defined, we can test it locally, we can do that in two ways:
- Compute feature values for a given range, defined by a ```start_time``` and ```end_time```
- Compute feature values based on a set of training events (join keys + timestamps) also called a spine dataframe

Here, we'll simply compute feature values for a 6 months range

In [None]:
start = datetime(2023,1,1)
end = datetime(2023,6,1)

tdf = user_transaction_metrics.get_historical_features(start_time=start, end_time=end)
type(tdf)

In [None]:
# We can convert this Tecton DataFrame into a Pandas dataframe (or Spark dataframe)
tdf.to_pandas().head()

### Review Questions:
* What is a Tecton [entity](https://docs.tecton.ai/docs/defining-features/entities)?  
* What is a Tecton [feature view](https://docs.tecton.ai/docs/defining-features/feature-views)?  
* What is a Tecton [dataframe](https://docs.tecton.ai/docs/sdk-reference/interacting-with-the-feature-store/tecton.TectonDataFrame)?  

## C) Create a Feature Service

A **Feature Service** represents the features that are needed by a model (or group of models). A feature service traditionally sources features from multiple feature views.

You can create and test a feature service by:

1. Adding a feature view to an existing feature service
2. Create a net new feature service

Here we are locally creating a new version of the feature_service and adding a new feature to it.
```
from tecton import FeatureService

existing_feature_service = ws.get_feature_service("fraud_service")
feature_list = existing_feature_service.get_features_list()

fraud_service_v2 = FeatureService(
    name="fraud_service:v2",
    features=feature_list+user_transaction_metrics # add the new feature to the features list
)
```

In [None]:
from tecton import FeatureService


fraud_detection_feature_service = FeatureService(
    name="fraud_detection_feature_service", features=[user_transaction_metrics]
)

fraud_detection_feature_service.validate()

### Generate training data from our service
We'll build our training dataset from labeled historical transactions and try to predict the "is_fraud" column for a given transaction.

Let's load up some raw training events.

In [None]:
def query_snowflake(query):
    df = conn.cursor().execute(query).fetch_pandas_all()
    return df

training_events = query_snowflake("""
    select 
        USER_ID,
        TIMESTAMP,
        AMT,
        IS_FRAUD
    from
        TRANSACTIONS
""")

display(training_events.head(5))

Once these labelled training events have been loaded into a pandas DataFrame, we want to enrich them with features from Tecton in a point in time accurate way.
We can pass our ```training_events``` as an argument of our ```.get_historical_features()``` function. For each entity ID and timestamp in our training events DataFrame, Tecton will compute/read from all Feature Views in our Feature Service, join values based on timestamp and return an enriched dataframe

In [None]:
training_data = fraud_detection_feature_service.get_historical_features(training_events).to_pandas().fillna(0)
display(training_data.head(5))

## D) Train a Model


In [None]:
!pip install scikit-learn

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


df = training_data.drop(["USER_ID", "TIMESTAMP", "AMT"], axis=1)

X = df.drop("IS_FRAUD", axis=1)
y = df["IS_FRAUD"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

num_cols = X_train.select_dtypes(exclude=["object"]).columns.tolist()
cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()

num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N/A"), OneHotEncoder(handle_unknown="ignore", sparse=False)
)

full_pipe = ColumnTransformer([("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)])

model = make_pipeline(full_pipe, LogisticRegression(max_iter=1000, random_state=42))

model.fit(X_train, y_train)

y_predict = model.predict(X_test)

print(metrics.classification_report(y_test, y_predict, zero_division=0))

# 2. Deploy features to Tecton

Tecton objects get registered via a declarative workflow. Features are defined as code in a repo and applied to a workspace in a Tecton account using the Tecton CLI. This approach enables productionisation best practices such as "features as code," CI/CD, and unit testing.



### 1. Create Tecton repository

Let's switch over from our notebook to a terminal and create a new Tecton Feature Repository. For now we will put all our definitions in a single file.

✅ Run these commands to create a new Tecton repo.

```
mkdir tecton-feature-repo
cd tecton-feature-repo
touch features.py
tecton init
```

### 2. Copy definitions from the notebook and enable materialization
✅ Now copy & paste the definition of the Tecton objects you created in your notebook to ```features.py``` (copied below).

On our Feature View we've added four parameters to enable backfilling and ongoing materialization to the online and offline Feature Store:

```
online=True,
offline=True,
feature_start_time=datetime(2023,1,1),
batch_schedule=timedelta(days=1)
```
When we apply our changes to a Live Workspace, Tecton will automatically kick off jobs to backfill feature data from ```feature_start_time```. Frontfill jobs will then run on the defined ```batch_schedule```.

```python
from tecton import Entity, BatchSource,SnowflakeConfig, batch_feature_view, Aggregation, DeltaConfig, FeatureService
from tecton.types import Field, String, Timestamp, Float64, Int64
from datetime import datetime, timedelta

snowflake_config = SnowflakeConfig(
    url="https://tectonpartner.snowflakecomputing.com/",
    database="NAB_MFT_DB",
    schema="PUBLIC",
    warehouse="NAB_WH",
    table="TRANSACTIONS",
    timestamp_field="TIMESTAMP"
)

transactions = BatchSource(
    name='transactions',
    batch_config=snowflake_config,
)

# An entity defines the concept we are modeling features for
# The join keys will be used to aggregate, join, and retrieve features
user = Entity(name="user", join_keys=["USER_ID"])

# We use SQL to transform the raw data and Tecton aggregations to efficiently and accurately compute metrics across raw events
# Feature View decorators contain a wide range of parameters for materializing, cataloging, and monitoring features
@batch_feature_view(
    description="User transaction metrics over 1, 3 and 7 days",
    sources=[transactions],
    entities=[user],
    mode="pandas",
    aggregation_interval=timedelta(days=1),
    aggregations=[
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=1)),
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=3)),
        Aggregation(function="mean", column="AMT", time_window=timedelta(days=7)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=1)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=3)),
        Aggregation(function="count", column="AMT", time_window=timedelta(days=7)),
    ],
    schema=[Field("USER_ID", String), Field("TIMESTAMP", Timestamp), Field("AMT", Float64)],
    offline_store=DeltaConfig(),
    online=True,
    offline=True,
    feature_start_time=datetime(2023, 1, 1),
    batch_schedule=timedelta(days=1)
)
def user_transaction_metrics(transactions):
    return transactions[["USER_ID", "TIMESTAMP", "AMT"]]


fraud_detection_feature_service = FeatureService(
    name="fraud_detection_feature_service", features=[user_transaction_metrics]
)

```

### 3. Apply your changes to a new workspace

Our last step is to login to your organization's Tecton account and apply our repo to a workspace!

✅ Run the following commands in your terminal to create a workspace and apply your changes:

```
tecton login https://demo-nebula.tecton.ai/
tecton workspace create [your-name]-quickstart --live
tecton apply
```

The output of ```tecton apply``` will look like this:
```
Using workspace "[your-name]-quickstart" on cluster https://app.tecton.ai
✅ Imported 1 Python module from the feature repository
✅ Imported 1 Python module from the feature repository
⚠️  Running Tests: No tests found.
✅ Collecting local feature declarations
✅ Performing server-side feature validation: Initializing.
 ↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓

  + Create Batch Data Source
    name:           transactions

  + Create Entity
    name:           user

  + Create Transformation
    name:           user_transaction_metrics
    description:    Trailing average transaction amount over 1, 3 and 7 days

  + Create Batch Feature View
    name:           user_transaction_metrics
    description:    Trailing average transaction amount over 1, 3 and 7 days
    materialization: 11 backfills, 1 recurring batch job
    > backfill:     10 Backfill jobs 2021-12-25 00:00:00 UTC to 2023-08-16 00:00:00 UTC writing to the Offline Store
                    1 Backfill job 2023-08-16 00:00:00 UTC to 2023-08-23 00:00:00 UTC writing to both the Online and Offline Store
    > incremental:  1 Recurring Batch job scheduled every 1 day writing to both the Online and Offline Store

  + Create Feature Service
    name:           fraud_detection_feature_service

 ↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
 Generated plan ID is 8d01ad78e3194a5dbd3f934f04d71564
 View your plan in the Web UI: https://app.tecton.ai/app/[your-name]-quickstart/plan-summary/8d01ad78e3194a5dbd3f934f04d71564
 ⚠️  Objects in plan contain warnings.

Note: Updates to Feature Services may take up to 60 seconds to be propagated to the real-time feature-serving endpoint.
Note: This workspace ([your-name]-quickstart) is a "Live" workspace. Applying this plan may result in new materialization jobs which will incur costs. Carefully examine the plan output before applying changes.
Are you sure you want to apply this plan to: "[your-name]-quickstart"? [y/N]> y
🎉 all done!
```

# 3. Deploy Model and read features from Tecton in Real-time

### A) Retrieve features at low-latency

Our real-time Fraud detection model needs access to feature data within a very small latency budget, typically the end-to-end budget for feature retrieval + model prediction is around 100ms.

Once features are deployed to Tecton, they can be consumed in real-time in a variety of ways: 
- [REST API](https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference/reading-online-features-for-inference-using-the-http-api)
- [Python Client](https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference/reading-online-features-for-inference-using-the-python-client)
- [Java Client](https://docs.tecton.ai/docs/reading-feature-data/reading-feature-data-for-inference/reading-online-features-for-inference-using-the-java-client)

Here, let's use Tecton's REST API to retrieve features at low latency.

To do this, you will first need to create a new Service Account and give it access to read features from your workspace. This Service Account will be used to authenticate your calls to Tecton.

Follow these commands in your terminal:

```
tecton service-account create --name "[your-name]-quickstart" --description "Quickstart service account"
tecton access-control assign-role -r consumer -w [your-name]-quickstart -s [service account id from last command]
```

You will use the API key from the first command in the cell below where we define a function to retrieve online feature data for a given user.

In [None]:
import requests, json


def get_online_feature_data(user_id):
    TECTON_API_KEY = "[your-api-key]"
    WORKSPACE_NAME = "[your-workspace-name]"

    headers = {"Authorization": "Tecton-key " + TECTON_API_KEY}

    request_data = f"""{{
        "params": {{
            "feature_service_name": "fraud_detection_feature_service",
            "join_key_map": {{"USER_ID": "{user_id}"}},
            "metadata_options": {{"include_names": true}},
            "workspace_name": "{WORKSPACE_NAME}"
        }}
    }}"""

    online_feature_data = requests.request(
        method="POST",
        headers=headers,
        url=f"https://demo-nebula.tecton.ai/api/v1/feature-service/get-features",
        data=request_data,
    )

    online_feature_data_json = json.loads(online_feature_data.text)

    return online_feature_data_json

Now we can use our function to retrieve features at low latency!

In [None]:
user_id = "user_502567604689"

feature_data = get_online_feature_data(user_id)

if "result" not in feature_data:
    print("Feature data is not materialized")
else:
    print(feature_data["result"])

### B) Make model predictions

Now that we can fetch feature data online, let's create a function that takes a feature vector and runs model inference to get a fraud prediction.

💡**INFO**
Typically you'd instead use a model serving API that is hosting your model. Here we run inference directly in our notebook for simplicity.

In [None]:
import pandas as pd


def get_prediction_from_model(feature_data):
    columns = [f["name"].replace(".", "__") for f in feature_data["metadata"]["features"]]
    data = [feature_data["result"]["features"]]

    features = pd.DataFrame(data, columns=columns)

    return model.predict(features)[0]

**Let's put it all together and run inference!**

We can fetch our online features from Tecton, call our inference function, and get a prediction.

In [None]:
user_id = "user_502567604689"

online_feature_data = get_online_feature_data(user_id)
prediction = get_prediction_from_model(online_feature_data)

print(prediction)