# #5 Discovering Butterfree - Interval Runs

Welcome to Discovering Butterfree tutorial series!

This is the fifth tutorial of this series: its goal is to cover interval runs.

Before diving into the tutorial make sure you have a basic understanding of these main data concepts: features, feature sets and the "Feature Store Architecture", you can read more about this [here].

## Example:

Simulating the following scenario (the same from previous tutorials):

- We want to create a feature set with features about houses for rent (listings).


We have an input dataset:

- Table: `listing_events`. Table with data about events of house listings.


Our desire is to have three resulting datasets with the following schema:

* id: **int**;
* timestamp: **timestamp**;
* rent__avg_over_1_day_rolling_windows: **double**;
* rent__stddev_pop_over_1_day_rolling_windows: **double**.
 
The first dataset will be computed with just an end date time limit. The second one, on the other hand, uses both start and end date in order to filter data. Finally, the third one will be the result of a daily run. You can understand more about these definitions in our documentation.

The following code blocks will show how to generate this feature set using Butterfree library:

In [1]:
# setup spark
from pyspark import SparkContext, SparkConf
from pyspark.sql import session

conf = SparkConf().setAll([('spark.driver.host','127.0.0.1'), ('spark.sql.session.timeZone', 'UTC')])
sc = SparkContext(conf=conf)
spark = session.SparkSession(sc)

In [2]:
# fix working dir
import pathlib
import os
path = os.path.join(pathlib.Path().absolute(), '../..')
os.chdir(path)

### Showing test data

In [3]:
listing_events_df = spark.read.json(f"{path}/examples/data/listing_events.json")
listing_events_df.createOrReplaceTempView("listing_events")  # creating listing_events view

region = spark.read.json(f"{path}/examples/data/region.json")

Listing events table:

In [4]:
listing_events_df.toPandas()

Unnamed: 0,area,bathrooms,bedrooms,id,region_id,rent,timestamp
0,50,1,1,1,1,1300,1588302000000
1,50,1,1,1,1,2000,1588647600000
2,100,1,2,2,2,1500,1588734000000
3,100,1,2,2,2,2500,1589252400000
4,150,2,2,3,3,3000,1589943600000
5,175,2,2,4,4,3200,1589943600000
6,250,3,3,5,5,3200,1590030000000
7,225,3,2,6,6,3200,1590116400000


Region table:

In [5]:
region.toPandas()

Unnamed: 0,city,id,lat,lng,region
0,Cerulean,1,73.44489,31.7503,Kanto
1,Veridian,2,-9.4351,-167.11772,Kanto
2,Cinnabar,3,29.73043,117.66164,Kanto
3,Pallet,4,-52.95717,-81.15251,Kanto
4,Violet,5,-47.35798,-178.77255,Johto
5,Olivine,6,51.7282,46.21958,Johto


### Extract

- For the extract part, we need the `Source` entity and the `FileReader` for the data we have;
- We need to declare a query in order to bring the results from our lonely reader (it's as simples as a select all statement).

In [6]:
from butterfree.clients import SparkClient
from butterfree.extract import Source
from butterfree.extract.readers import FileReader, TableReader
from butterfree.extract.pre_processing import filter

readers = [
    TableReader(id="listing_events", table="listing_events",),
    FileReader(id="region", path=f"{path}/examples/data/region.json", format="json",)
]

query = """
select
    listing_events.*,
    region.city,
    region.region,
    region.lat,
    region.lng,
    region.region as region_name
from
    listing_events
    join region
      on listing_events.region_id = region.id
"""

source = Source(readers=readers, query=query)

In [7]:
spark_client = SparkClient()
source_df = source.construct(spark_client)

And, finally, it's possible to see the results from building our souce dataset:

In [8]:
source_df.toPandas()

Unnamed: 0,area,bathrooms,bedrooms,id,region_id,rent,timestamp,city,region,lat,lng,region_name
0,50,1,1,1,1,1300,1588302000000,Cerulean,Kanto,73.44489,31.7503,Kanto
1,50,1,1,1,1,2000,1588647600000,Cerulean,Kanto,73.44489,31.7503,Kanto
2,100,1,2,2,2,1500,1588734000000,Veridian,Kanto,-9.4351,-167.11772,Kanto
3,100,1,2,2,2,2500,1589252400000,Veridian,Kanto,-9.4351,-167.11772,Kanto
4,150,2,2,3,3,3000,1589943600000,Cinnabar,Kanto,29.73043,117.66164,Kanto
5,175,2,2,4,4,3200,1589943600000,Pallet,Kanto,-52.95717,-81.15251,Kanto
6,250,3,3,5,5,3200,1590030000000,Violet,Johto,-47.35798,-178.77255,Johto
7,225,3,2,6,6,3200,1590116400000,Olivine,Johto,51.7282,46.21958,Johto


### Transform
- At the transform part, a set of `Feature` objects is declared;
- An Instance of `AggregatedFeatureSet` is used to hold the features;
- An `AggregatedFeatureSet` can only be created when it is possible to define a unique tuple formed by key columns and a time reference. This is an **architectural requirement** for the data. So least one `KeyFeature` and one `TimestampFeature` is needed;
- Every `Feature` needs a unique name, a description, and a data-type definition. Besides, in the case of the `AggregatedFeatureSet`, it's also mandatory to have an `AggregatedTransform` operator;
- An `AggregatedTransform` operator is used, as the name suggests, to define aggregation functions.

In [9]:
from pyspark.sql import functions as F

from butterfree.transform.aggregated_feature_set import AggregatedFeatureSet
from butterfree.transform.features import Feature, KeyFeature, TimestampFeature
from butterfree.transform.transformations import AggregatedTransform
from butterfree.constants import DataType
from butterfree.transform.utils import Function

keys = [
    KeyFeature(
        name="id",
        description="Unique identificator code for houses.",
        dtype=DataType.BIGINT,
    )
]

# from_ms = True because the data originally is not in a Timestamp format.
ts_feature = TimestampFeature(from_ms=True)

features = [
    Feature(
        name="rent",
        description="Rent value by month described in the listing.",
        transformation=AggregatedTransform(
             functions=[
                 Function(F.avg, DataType.DOUBLE),
                 Function(F.stddev_pop, DataType.DOUBLE),
             ],
        filter_expression="region_name = 'Kanto'",
        ),
    )
]

aggregated_feature_set = AggregatedFeatureSet(
    name="house_listings",
    entity="house",  # entity: to which "business context" this feature set belongs
    description="Features describring a house listing.",
    keys=keys,
    timestamp=ts_feature,
    features=features,
).with_windows(definitions=["1 day"])

Here, we'll define out first aggregated feature set, with just an `end date` parameter:

In [10]:
aggregated_feature_set_windows_df = aggregated_feature_set.construct(
    source_df, 
    spark_client, 
    end_date="2020-05-30"
)

The resulting dataset is:

In [11]:
aggregated_feature_set_windows_df.orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-01,,
1,1,2020-05-02,1300.0,0.0
2,1,2020-05-03,,
3,1,2020-05-06,2000.0,0.0
4,1,2020-05-07,,
5,2,2020-05-01,,
6,2,2020-05-07,1500.0,0.0
7,2,2020-05-08,,
8,2,2020-05-13,2500.0,0.0
9,2,2020-05-14,,


It's possible to see that if we use both a `start date` and `end_date` values. Then we'll achieve a time slice of the last dataframe, as it's possible to see:

In [12]:
aggregated_feature_set.construct(
    source_df, 
    spark_client, 
    end_date="2020-05-21",
    start_date="2020-05-06",
).orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-06,2000.0,0.0
1,1,2020-05-07,,
2,2,2020-05-06,,
3,2,2020-05-07,1500.0,0.0
4,2,2020-05-08,,
5,2,2020-05-13,2500.0,0.0
6,2,2020-05-14,,
7,3,2020-05-06,,
8,3,2020-05-21,3000.0,0.0
9,4,2020-05-06,,


### Load

- For the load part we need `Writer` instances and a `Sink`;
- `writers` define where to load the data;
- The `Sink` gets the transformed data (feature set) and trigger the load to all the defined `writers`;
- `debug_mode` will create a temporary view instead of trying to write in a real data store.

In [13]:
from butterfree.load.writers import (
    HistoricalFeatureStoreWriter,
    OnlineFeatureStoreWriter,
)
from butterfree.load import Sink

writers = [HistoricalFeatureStoreWriter(debug_mode=True, interval_mode=True), 
           OnlineFeatureStoreWriter(debug_mode=True, interval_mode=True)]
sink = Sink(writers=writers)

## Pipeline

- The `Pipeline` entity wraps all the other defined elements.
- `run` command will trigger the execution of the pipeline, end-to-end.

In [14]:
from butterfree.pipelines import FeatureSetPipeline

pipeline = FeatureSetPipeline(source=source, feature_set=aggregated_feature_set, sink=sink)

The first run will use just an `end_date` as parameter:

In [15]:
result_df = pipeline.run(end_date="2020-05-30")

In [16]:
spark.table("historical_feature_store__house_listings").orderBy(
    "id", "timestamp"
).orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows,year,month,day
0,1,2020-05-01,,,2020,5,1
1,1,2020-05-02,1300.0,0.0,2020,5,2
2,1,2020-05-03,,,2020,5,3
3,1,2020-05-06,2000.0,0.0,2020,5,6
4,1,2020-05-07,,,2020,5,7
5,2,2020-05-01,,,2020,5,1
6,2,2020-05-07,1500.0,0.0,2020,5,7
7,2,2020-05-08,,,2020,5,8
8,2,2020-05-13,2500.0,0.0,2020,5,13
9,2,2020-05-14,,,2020,5,14


In [17]:
spark.table("online_feature_store__house_listings").orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-07,,
1,2,2020-05-14,,
2,3,2020-05-22,,
3,4,2020-05-22,,
4,5,2020-05-01,,
5,6,2020-05-01,,


- We can see that we were able to create all the desired features in an easy way
- The **historical feature set** holds all the data, and we can see that it is partitioned by year, month and day (columns added in the `HistoricalFeatureStoreWriter`)
- In the **online feature set** there is only the latest data for each id

The second run, on the other hand, will use both a `start_date` and `end_date` as parameters.

In [18]:
result_df = pipeline.run(end_date="2020-05-21", start_date="2020-05-06")

In [19]:
spark.table("historical_feature_store__house_listings").orderBy(
    "id", "timestamp"
).orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows,year,month,day
0,1,2020-05-06,2000.0,0.0,2020,5,6
1,1,2020-05-07,,,2020,5,7
2,2,2020-05-06,,,2020,5,6
3,2,2020-05-07,1500.0,0.0,2020,5,7
4,2,2020-05-08,,,2020,5,8
5,2,2020-05-13,2500.0,0.0,2020,5,13
6,2,2020-05-14,,,2020,5,14
7,3,2020-05-06,,,2020,5,6
8,3,2020-05-21,3000.0,0.0,2020,5,21
9,4,2020-05-06,,,2020,5,6


In [20]:
spark.table("online_feature_store__house_listings").orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-07,,
1,2,2020-05-14,,
2,3,2020-05-21,3000.0,0.0
3,4,2020-05-21,3200.0,0.0
4,5,2020-05-06,,
5,6,2020-05-06,,


Finally, the third run, will use only an `execution_date` as a parameter.

In [21]:
result_df = pipeline.run_for_date(execution_date="2020-05-21")

In [22]:
spark.table("historical_feature_store__house_listings").orderBy(
    "id", "timestamp"
).orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows,year,month,day
0,1,2020-05-21,,,2020,5,21
1,2,2020-05-21,,,2020,5,21
2,3,2020-05-21,3000.0,0.0,2020,5,21
3,4,2020-05-21,3200.0,0.0,2020,5,21
4,5,2020-05-21,,,2020,5,21
5,6,2020-05-21,,,2020,5,21


In [23]:
spark.table("online_feature_store__house_listings").orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-21,,
1,2,2020-05-21,,
2,3,2020-05-21,3000.0,0.0
3,4,2020-05-21,3200.0,0.0
4,5,2020-05-21,,
5,6,2020-05-21,,
