![](https://66.media.tumblr.com/tumblr_macyx8VqU11rfjowdo1_500.gif)


# #1 Discovering Butterfree - Feature Set Basics

Welcome to **Discovering Butterfree** tutorial series!

This first tutorial will cover some basics of Butterfree library and you learn how to create your first feature set :rocket: :rocket:

Before diving into the tutorial make sure you have a basic understanding of these main data concepts: **features**, **feature sets** and the **"Feature Store Architecture"**, you can read more about this [here]().

## Library Basics:

Buterfree's main objective is to make feature engineering easy. The library provides a high-level API for declarative feature definitions. But behind these abstractions, Butterfree is essentially an **ETL (Extract - Transform - Load)** framework, so this reflects in terms of the organization of the project.

### Extract

`from butterfree.core.extract import ...`

Module with the entities responsible for extracting data into the pipeline. The module provides the following tools:

* `readers`: data connectors. Currently Butterfree provides readers for files, tables registered in Spark Hive metastore, and Kafka topics.


* `pre_processing`: a utility tool for making some transformations or re-arrange the structure of the reader's input data before the feature engineering.


* `source`: a composition of `readers`. The entity responsible for merging datasets coming from the defined readers into a single dataframe input for the `Transform` stage.

### Transform

`from butterfree.core.transform import ...`

The main module of the library, responsible for feature engineering, in other words, all the transformations on the data. The module provides the following main tools:

* `features`: the entity that defines what a feature is. Holds a transformation and metadata about the feature.


* `transformations`: provides a set of components for transforming the data, with the possibility to use Spark native functions, aggregations, SQL expressions and others. 


* `feature_set`: an entity that defines a feature set. Holds features and the metadata around it.


### Load

`from butterfree.core.load import ...`

The module is responsible for saving the data in some data storage. The module provides the following tools:

* `writers`: provide connections to data sources to write data. Currently Butterfree provides ways to save data on S3 registered as tables Spark Hive metastore and to Cassandra DB.


* `sink`: a composition of writers. The entity responsible for triggering the writing jobs on a set of defined writers

### Pipelines

Pipelines are responsible for integrating all other modules (`extract`, `transform`, `load`) in order to define complete ETL jobs from source data to data storage destination.

`from butterfree.core.pipelines import ...`

* `feature_set_pipeline`: defines an ETL pipeline for creating feature sets.




## Example:
Simulating the following scenario:

- We want to create a feature set with features about houses for rent (listings).

- We are interested in houses only for the **Kanto** region.

We have two sets of data:

- Table: `listing_events`. Table with data about events of house listings.
- File: `region.json`. Static file with data about the cities and regions.

Our desire is to have result dataset with the following schema:

| id | timestamp | rent | rent_over_area | bedrooms | bathrooms | area | bedrooms_over_area | bathrooms_over_area | latitude | longitude | h3 | city | region 
| - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| int | timestamp | float | float | int | int | float | float | float | double | double | string | string | string |

For more information about H3 geohash click [here]()

The following code blocks will show how to generate this feature set using Butterfree library:



In [1]:
# setup spark
from pyspark import SparkContext, SparkConf
from pyspark.sql import session

conf = SparkConf().set('spark.driver.host','127.0.0.1')
sc = SparkContext(conf=conf)
spark = session.SparkSession(sc)

In [2]:
# fix working dir
import pathlib
import os
path = os.path.join(pathlib.Path().absolute(), '../..')
os.chdir(path)

### Showing test data

In [3]:
listing_evengs_df = spark.read.json("listing_events.json")
listing_evengs_df.write.mode("overwrite").saveAsTable("listing_events")  # creating listing_events table

print(">>> listing_events table:")
listing_evengs_df.show()

print(">>> region.json file:")
spark.read.json("region.json").show()

>>> listing_events table:
+----+---------+--------+---+---------+----+-------------+
|area|bathrooms|bedrooms| id|region_id|rent|    timestamp|
+----+---------+--------+---+---------+----+-------------+
|  50|        1|       1|  1|        1|1300|1588302000000|
|  50|        1|       1|  1|        1|2000|1588647600000|
| 100|        1|       2|  2|        2|1500|1588734000000|
| 100|        1|       2|  2|        2|2500|1589252400000|
| 150|        2|       2|  3|        3|3000|1589943600000|
| 175|        2|       2|  4|        4|3200|1589943600000|
| 250|        3|       3|  5|        5|3200|1590030000000|
| 225|        3|       2|  6|        6|3200|1590116400000|
+----+---------+--------+---+---------+----+-------------+

>>> region.json file:
+--------+---+---------+----------+------+
|    city| id|      lat|       lng|region|
+--------+---+---------+----------+------+
|Cerulean|  1| 73.44489|   31.7503| Kanto|
|Veridian|  2|  -9.4351|-167.11772| Kanto|
|Cinnabar|  3| 29.73043| 117

### Extract

- For the extract part, we need the `Source` entity and the `FileReader` and `TableReader` for the data we have.
- We need to declare a query with the rule for joining the results of the readers too.
- As proposed in the problem we can filter the region dataset to get only **Kanto** region.


In [4]:
from butterfree.core.extract import Source
from butterfree.core.extract.readers import FileReader, TableReader
from butterfree.core.extract.pre_processing import filter

readers = [
    TableReader(id="listing_events", database="default", table="listing_events",),
    FileReader(id="region", path="region.json", format="json",).with_(
        transformer=filter, condition="region == 'Kanto'"
    ),
]

query = """
select
    listing_events.*,
    region.city,
    region.region,
    region.lat,
    region.lng,
    region.region as region_name
from
    listing_events
    join region
      on listing_events.region_id = region.id
"""

source = Source(readers=readers, query=query)

### Transform
- At the transform part, a set of `Feature` objects is declared.
- An Instance of `FeatureSet` is used to hold the features.
- A `FeatureSet` can only be created when it is possible to define a unique tuple formed by key columns and a time reference. This is an **architectural requirement** for the data. So least one `KeyFeature` and one `TimestampFeature` is needed.
- Every `Feature` needs a unique name, a description, and a data-type definition.

In [5]:
from butterfree.core.transform import FeatureSet
from butterfree.core.transform.features import Feature, KeyFeature, TimestampFeature
from butterfree.core.transform.transformations import SQLExpressionTransform
from butterfree.core.transform.transformations.h3_transform import H3HashTransform
from butterfree.core.constants.data_type import DataType

keys = [
    KeyFeature(
        name="id",
        description="Unique identificator code for houses.",
        dtype=DataType.BIGINT,
    )
]

# from_ms = True because the data originally is not in a Timestamp format.
ts_feature = TimestampFeature(from_column="timestamp", from_ms=True)

features = [
    Feature(
        name="rent",
        description="Rent value by month described in the listing.",
        dtype=DataType.FLOAT,
    ),
    Feature(
        name="rent_over_area",
        description="Rent value by month divided by the area of the house.",
        transformation=SQLExpressionTransform("rent / area"),
        dtype=DataType.FLOAT,
    ),
    Feature(
        name="bedrooms",
        description="Number of bedrooms of the house.",
        dtype=DataType.INTEGER,
    ),
    Feature(
        name="bathrooms",
        description="Number of bathrooms of the house.",
        dtype=DataType.INTEGER,
    ),
    Feature(
        name="area",
        description="Area of the house, in squared meters.",
        dtype=DataType.STRING,
    ),
    Feature(
        name="bedrooms_over_area",
        description="Number of bedrooms divided by the area.",
        transformation=SQLExpressionTransform("bedrooms / area"),
        dtype=DataType.STRING,
    ),
    Feature(
        name="bathrooms_over_area",
        description="Number of bathrooms divided by the area.",
        transformation=SQLExpressionTransform("bathrooms / area"),
        dtype=DataType.STRING,
    ),
    Feature(
        name="latitude",
        description="House location latitude.",
        from_column="lat",  # arg from_column is needed when changing column name
        dtype=DataType.DOUBLE,
    ),
    Feature(
        name="longitude",
        description="House location longitude.",
        from_column="lng",
        dtype=DataType.DOUBLE,
    ),
    Feature(
        name="h3",
        description="H3 hash geohash.",
        transformation=H3HashTransform(
            h3_resolutions=[10], lat_column="latitude", lng_column="longitude",
        ),
        dtype=DataType.STRING,
    ),
    Feature(name="city", description="House location city.", dtype=DataType.STRING,),
    Feature(
        name="region",
        description="House location region.",
        from_column="region_name",
        dtype=DataType.STRING,
    ),
]

feature_set = FeatureSet(
    name="house_listings",
    entity="house",  # entity: to which "business context" this feature set belongs
    description="Features describring a house listing.",
    keys=keys,
    timestamp=ts_feature,
    features=features,
)

### Load

- For the load part we need `Writer` instances and a `Sink`.
- writers define where to load the data.
- The `Sink` gets the transformed data (feature set) and trigger the load to all the defined writers.
- `debug_mode` will create a temporary view instead of trying to write in a real data store.

In [6]:
from butterfree.core.load.writers import (
    HistoricalFeatureStoreWriter,
    OnlineFeatureStoreWriter,
)
from butterfree.core.load import Sink

writers = [HistoricalFeatureStoreWriter(debug_mode=True), OnlineFeatureStoreWriter(debug_mode=True)]
sink = Sink(writers=writers)

## Pipeline

- The `Pipeline` entity wraps all the other defined elements.
- `run` command will trigger the execution of the pipeline, end-to-end.

In [7]:
from butterfree.core.pipelines import FeatureSetPipeline

pipeline = FeatureSetPipeline(source=source, feature_set=feature_set, sink=sink)

In [8]:
result_df = pipeline.run()

  f"The column name {self.name} "
  f"The column name {self.name} "


### Showing the results

In [9]:
# fixing wrap output
from IPython.core.display import display, HTML
display(HTML("<style>div.output_area pre {white-space: pre;}</style>"))

print(">>> Historical Feature house_listings feature set table:")
spark.table("historical_feature_store__house_listings").orderBy(
    "id", "timestamp"
).show()

print(">>> Online Feature house_listings feature set table:")
spark.table("online_feature_store__house_listings").orderBy("id", "timestamp").show()

>>> Historical Feature house_listings feature set table:
+---+-------------------+------+------------------+--------+---------+----+--------------------+--------------------+---------+----------+--------------------+--------+------+----+-----+---+
| id|          timestamp|  rent|    rent_over_area|bedrooms|bathrooms|area|  bedrooms_over_area| bathrooms_over_area| latitude| longitude|lat_lng__h3_hash__10|    city|region|year|month|day|
+---+-------------------+------+------------------+--------+---------+----+--------------------+--------------------+---------+----------+--------------------+--------+------+----+-----+---+
|  1|2020-05-01 00:00:00|1300.0|              26.0|       1|        1|  50|                0.02|                0.02| 73.44489|   31.7503|     8a011c942b5ffff|Cerulean| Kanto|2020|    5|  1|
|  1|2020-05-05 00:00:00|2000.0|              40.0|       1|        1|  50|                0.02|                0.02| 73.44489|   31.7503|     8a011c942b5ffff|Cerulean| Kanto|2020

- We can see that we were able to create all the desired features in an easy way
- The **historical feature set** holds all the data, and we can see that it is partitioned by year, month and day (columns added in the `HistoricalFeatureStoreWriter`)
- In the **online feature set** there is only the latest data for each id