![](https://66.media.tumblr.com/tumblr_macyx8VqU11rfjowdo1_500.gif)


# #3 Discovering Butterfree - Aggregated Feature Set

Welcome to Discovering Butterfree tutorial series!

This is the third tutorial of this series: its goal is to cover aggregated feature sets.

Before diving into the tutorial make sure you have a basic understanding of these main data concepts: features, feature sets and the "Feature Store Architecture", you can read more about this [here].

## Example:

Simulating the following scenario (the same from previous tutorial):

- We want to create a feature set with features about houses for rent (listings).


We have an input dataset:

- Table: `listing_events`. Table with data about events of house listings.


Our desire is to have two resulting datasets with the following schemas:

* id: **int**;
* timestamp: **timestamp**;
* rent__avg: **double**;
* rent__stddev_pop: **double**.
 
and

* id: **int**;
* timestamp: **timestamp**;
* rent__avg_over_1_day_rolling_windows: **double**;
* rent__stddev_pop_over_1_day_rolling_windows: **double**.

The first dataset will be computed with two simple aggregations: average and standard deviation. The second one, on the other hand, uses both these simple aggregations and a time window (defined as one day). You can understand more about these definitions in our Wiki.

The following code blocks will show how to generate this feature set using Butterfree library:

In [1]:
# setup spark
from pyspark import SparkContext, SparkConf
from pyspark.sql import session

conf = SparkConf().setAll([('spark.driver.host','127.0.0.1'), ('spark.sql.session.timeZone', 'UTC')])
sc = SparkContext(conf=conf)
spark = session.SparkSession(sc)

In [2]:
# fix working dir
import pathlib
import os
path = os.path.join(pathlib.Path().absolute(), '../..')
os.chdir(path)

### Showing test data

In [3]:
listing_evengs_df = spark.read.json(f"{path}/examples/data/listing_events.json")
listing_evengs_df.createOrReplaceTempView("listing_events")  # creating listing_events view

Listing events table:

In [4]:
listing_evengs_df.toPandas()

Unnamed: 0,area,bathrooms,bedrooms,id,region_id,rent,timestamp
0,50,1,1,1,1,1300,1588302000000
1,50,1,1,1,1,2000,1588647600000
2,100,1,2,2,2,1500,1588734000000
3,100,1,2,2,2,2500,1589252400000
4,150,2,2,3,3,3000,1589943600000
5,175,2,2,4,4,3200,1589943600000
6,250,3,3,5,5,3200,1590030000000
7,225,3,2,6,6,3200,1590116400000


### Extract

- For the extract part, we need the `Source` entity and the `FileReader` for the data we have;
- We need to declare a query in order to bring the results from our lonely reader (it's as simples as a select all statement).

In [5]:
from butterfree.core.clients import SparkClient
from butterfree.core.extract import Source
from butterfree.core.extract.readers import FileReader, TableReader
from butterfree.core.extract.pre_processing import filter

readers = [
    TableReader(id="listing_events", table="listing_events"),
]

query = """
select
    *
from
    listing_events
"""

source = Source(readers=readers, query=query)

In [6]:
spark_client = SparkClient()
source_df = source.construct(spark_client)

And, finally, it's possible to see the results from building our souce dataset:

In [7]:
source_df.toPandas()

Unnamed: 0,area,bathrooms,bedrooms,id,region_id,rent,timestamp
0,50,1,1,1,1,1300,1588302000000
1,50,1,1,1,1,2000,1588647600000
2,100,1,2,2,2,1500,1588734000000
3,100,1,2,2,2,2500,1589252400000
4,150,2,2,3,3,3000,1589943600000
5,175,2,2,4,4,3200,1589943600000
6,250,3,3,5,5,3200,1590030000000
7,225,3,2,6,6,3200,1590116400000


### Transform
- At the transform part, a set of `Feature` objects is declared;
- An Instance of `AggregatedFeatureSet` is used to hold the features;
- A `AggregatedFeatureSet` can only be created when it is possible to define a unique tuple formed by key columns and a time reference. This is an **architectural requirement** for the data. So least one `KeyFeature` and one `TimestampFeature` is needed;
- Every `Feature` needs a unique name, a description, and a data-type definition. Besides, in the case of the `AggregatedFeatureSet`, it's also mandatory to have an `AggregatedTransform` operator;
- An `AggregatedTransform` operator is used, as the name suggests, to define aggregation functions.

In [8]:
from pyspark.sql import functions as F

from butterfree.core.transform.aggregated_feature_set import AggregatedFeatureSet
from butterfree.core.transform.features import Feature, KeyFeature, TimestampFeature
from butterfree.core.transform.transformations import AggregatedTransform
from butterfree.core.constants.data_type import DataType
from butterfree.core.transform.utils.function import Function

keys = [
    KeyFeature(
        name="id",
        description="Unique identificator code for houses.",
        dtype=DataType.BIGINT,
    )
]

# from_ms = True because the data originally is not in a Timestamp format.
ts_feature = TimestampFeature(from_ms=True)

features = [
    Feature(
        name="rent",
        description="Rent value by month described in the listing.",
        transformation=AggregatedTransform(
             functions=[
                 Function(F.avg, DataType.DOUBLE),
                 Function(F.stddev_pop, DataType.DOUBLE),
             ],
        ),
    )
]

aggregated_feature_set = AggregatedFeatureSet(
    name="house_listings",
    entity="house",  # entity: to which "business context" this feature set belongs
    description="Features describring a house listing.",
    keys=keys,
    timestamp=ts_feature,
    features=features,
)

In [9]:
aggregated_feature_set_df = aggregated_feature_set.construct(source_df, spark_client)

The resulting dataset from the running the transformations defined within the `AggregatedFeatureSet` are:

In [10]:
aggregated_feature_set_df.toPandas()

Unnamed: 0,id,timestamp,rent__avg,rent__stddev_pop
0,6,2020-05-22 03:00:00,3200.0,0.0
1,5,2020-05-21 03:00:00,3200.0,0.0
2,1,2020-05-01 03:00:00,1300.0,0.0
3,1,2020-05-05 03:00:00,2000.0,0.0
4,3,2020-05-20 03:00:00,3000.0,0.0
5,2,2020-05-06 03:00:00,1500.0,0.0
6,2,2020-05-12 03:00:00,2500.0,0.0
7,4,2020-05-20 03:00:00,3200.0,0.0


Defining, now, a window to our `AggregatedFeatureSet`:

In [11]:
aggregated_feature_set.with_windows(definitions=["1 day"])
aggregated_feature_set_windows_df = aggregated_feature_set.construct(
    source_df, 
    spark_client, 
    end_date="2020-05-30"
)

The resulting dataset is:

In [12]:
aggregated_feature_set_windows_df.orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-01,,
1,1,2020-05-02,1300.0,0.0
2,1,2020-05-03,,
3,1,2020-05-06,2000.0,0.0
4,1,2020-05-07,,
5,2,2020-05-01,,
6,2,2020-05-07,1500.0,0.0
7,2,2020-05-08,,
8,2,2020-05-13,2500.0,0.0
9,2,2020-05-14,,


It's possible to see that if we use a different `end_date` value, we would achieve different results:

In [13]:
aggregated_feature_set.construct(
    source_df, 
    spark_client, 
    end_date="2020-05-20"
).orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-01,,
1,1,2020-05-02,1300.0,0.0
2,1,2020-05-03,,
3,1,2020-05-06,2000.0,0.0
4,1,2020-05-07,,
5,2,2020-05-01,,
6,2,2020-05-07,1500.0,0.0
7,2,2020-05-08,,
8,2,2020-05-13,2500.0,0.0
9,2,2020-05-14,,


### Load

- For the load part we need `Writer` instances and a `Sink`;
- `writers` define where to load the data;
- The `Sink` gets the transformed data (feature set) and trigger the load to all the defined `writers`;
- `debug_mode` will create a temporary view instead of trying to write in a real data store.

In [14]:
from butterfree.core.load.writers import (
    HistoricalFeatureStoreWriter,
    OnlineFeatureStoreWriter,
)
from butterfree.core.load import Sink

writers = [HistoricalFeatureStoreWriter(debug_mode=True), OnlineFeatureStoreWriter(debug_mode=True)]
sink = Sink(writers=writers)

## Pipeline

- The `Pipeline` entity wraps all the other defined elements.
- `run` command will trigger the execution of the pipeline, end-to-end.

In [15]:
from butterfree.core.pipelines import FeatureSetPipeline

pipeline = FeatureSetPipeline(source=source, feature_set=aggregated_feature_set, sink=sink)

In [16]:
result_df = pipeline.run(end_date="2020-06-30")

### Showing the results

In [17]:
spark.table("historical_feature_store__house_listings").orderBy(
    "id", "timestamp"
).toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows,year,month,day
0,1,2020-05-01,,,2020,5,1
1,1,2020-05-02,1300.0,0.0,2020,5,2
2,1,2020-05-03,,,2020,5,3
3,1,2020-05-06,2000.0,0.0,2020,5,6
4,1,2020-05-07,,,2020,5,7
5,2,2020-05-01,,,2020,5,1
6,2,2020-05-07,1500.0,0.0,2020,5,7
7,2,2020-05-08,,,2020,5,8
8,2,2020-05-13,2500.0,0.0,2020,5,13
9,2,2020-05-14,,,2020,5,14


In [18]:
spark.table("online_feature_store__house_listings").orderBy("id", "timestamp").toPandas()

Unnamed: 0,id,timestamp,rent__avg_over_1_day_rolling_windows,rent__stddev_pop_over_1_day_rolling_windows
0,1,2020-05-07,,
1,2,2020-05-14,,
2,3,2020-05-22,,
3,4,2020-05-22,,
4,5,2020-05-23,,
5,6,2020-05-24,,


- We can see that we were able to create all the desired features in an easy way
- The **historical feature set** holds all the data, and we can see that it is partitioned by year, month and day (columns added in the `HistoricalFeatureStoreWriter`)
- In the **online feature set** there is only the latest data for each id