# (2) Automated Feature Engineering

In this secondary notebook, we will apply automated **feature engineering** via `featuretools`. It is worthwile to mention that `h2o` contains two algorithms that are somewhat related to what we are doing here: [Aggregator](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html), [Principal Component Analysis (PCP)](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html), and [Generalized Low Rank Models (GLRM)](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glrm.html). While the functionalities of `featuretools` surpasses **Aggregator** algorithm of `h2o`, **PCA** and **GLRM** could be explored more for dimensionality reduction. This is especially relevant for the problem at hand, as we will be generating over hundred features in the end of this notebook, and training models based on these features will take a lot of time.

In [1]:
import pandas as pd
# !pip install featuretools
# !pip install graphviz
import featuretools as ft
from utils import unknown_to_nan

## Reading Data

### Important Notes
* We don't want to use the parameter `@index_col` of `pandas.read_csv()` method. For establishing **one-to-many** relationships between different variables in different data frames with `featuretools`, we need to make sure to have them as separate columns rather than indices for access. Specifically, here we are referring to `entity_from_dataframe()` method of the `featuretools.EntitySet()` which accepts a parameter `@index`. We will cover these in the next notebook. 
* We will also not be utilizing `countries.csv` in this example as countries referred to here are the labels (to-be predicted values) for the problem at hand.
* Altough countries in this file and the country destinations in `train_users.csv` constitute a **one-to-many** relationship, we will remove 'country_destinations' column from the training data frame as we don't want to synthesize any new features on it by transformations or aggregations via `featuretools`. The `dfs()` method, as we shall see later on this notebook, alternatively contains the argument `@ignore_variables` and we could have used this too, but this is probably a safer approach.

In [2]:
date_variables = ['date_account_created', 'timestamp_first_active']
users_df = pd.read_csv('(1)data_manual_ops/train_users.csv', parse_dates=date_variables)
users_test_df = pd.read_csv('(1)data_manual_ops/test_users.csv', parse_dates=date_variables)
buckets_df = pd.read_csv('(1)data_manual_ops/age_gender_bkts.csv')
sessions_df = pd.read_csv('(0)data/sessions.csv')

# Sort data frames by user ID to match the feature matrix featuretools will yield
users_df.sort_values('id', inplace=True)
users_test_df.sort_values('id', inplace=True)

# Remove target variable from training set
targets = users_df['country_destination'].values
users_df.drop('country_destination', axis=1, inplace=True)

## Feature Extraction & Synthesis with `featuretools`

### Entities & EntitySet Setup
An `EntitySet` is a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. Here, each `Entity` can be roughly thought of as an independent data frame and the `EntitySet` wraps these data frames together via **one-to-many** relationships. 

After creating an `Entity` within an `EntitySet` as shown below, you can use `print(<EntitySet[<EntityID>].variable_types)` to see the mapping of string variable IDs (column names of the data frame) to the corresponding `ft.variable_types.variable` and check if there is any disparity OR something that could be improved. Some main variable types included are: *Index*, *TimeIndex*, *Datetime*, *Numeric*, *Categorical*, *Ordinal*, *Boolean*, *Text*, *LatLong*, *ZIPCode*, *IPAddress*, *FullName*, *Email Address*, *URL*, *PhoneNumber*, *DateOfBirth*, and *CountryCode*. Correctly mapping variables to variable types will enhance the utility of primitives as we will discuss soon. For example, in our problem, 'signup_flow' column will be converted to a categorical variable whereas it was taken as a numerical variable by `pandas` initially.

In case a data frame that we are adding to the entity set as a new entity doesn't already have a unique ID column, we can create a new index by passing an arbitrary column name to `@index parameter` and suppress the warning by enabling parameter `@make_index`.

In this example we will create seperate `EntitySet`s for training and testing users. This is the recommended solution as outlined by Max Kanter (creator of `featuretools`) [here](https://stackoverflow.com/questions/49711987/how-do-i-prevent-data-leakage-with-featuretools). Although other methods could also potentially work, my experience was that `featuretools` automatically sorts data frame rows during **DFS (Deep Feature Synthesis)** and extracting training and testing data frames through user IDs was a very messy method.

In [3]:
# Initialize the main training entity set
es = ft.EntitySet(id='bookings')
# Add training users to training entity set as a new entity
es = es.entity_from_dataframe(entity_id='users',
                              dataframe=users_df,
                              index='id',
                              variable_types={'signup_flow': ft.variable_types.Categorical})
# Add sessions to training entity set as a new entity
es = es.entity_from_dataframe(entity_id='sessions',
                              dataframe=sessions_df,
                              index='data_id',
                              make_index=True)
# Add age & gender buckets to training entity set as a new entity
es = es.entity_from_dataframe(entity_id='buckets',
                              dataframe=buckets_df,
                              index='bucket_id')
# es.plot()  # display training entity set
print(es)
print(es['users'].variable_types)

# Initialize the main testing entity set
es_test = ft.EntitySet(id='bookings_test')
# Add testing users to testing entity set as a new entity
es_test = es_test.entity_from_dataframe(entity_id='users',
                                        dataframe=users_test_df,
                                        index='id',
                                        variable_types={'signup_flow': ft.variable_types.Categorical})
# Add sessions to testing entity set as a new entity
es_test = es_test.entity_from_dataframe(entity_id='sessions',
                                        dataframe=sessions_df,
                                        index='data_id')
# Add age & gender buckets to testing entity set as a new entity
es_test = es_test.entity_from_dataframe(entity_id='buckets',
                                   dataframe=buckets_df,
                                   index='bucket_id')
# es_test.plot()  # display testing entity set
print(es_test)
print(es_test['users'].variable_types)

Entityset: bookings
  Entities:
    users [Rows: 212593, Columns: 15]
    sessions [Rows: 10567737, Columns: 7]
    buckets [Rows: 42, Columns: 11]
  Relationships:
    No relationships
{'id': <class 'featuretools.variable_types.variable.Index'>, 'date_account_created': <class 'featuretools.variable_types.variable.Datetime'>, 'timestamp_first_active': <class 'featuretools.variable_types.variable.Datetime'>, 'gender': <class 'featuretools.variable_types.variable.Categorical'>, 'age': <class 'featuretools.variable_types.variable.Numeric'>, 'signup_method': <class 'featuretools.variable_types.variable.Categorical'>, 'language': <class 'featuretools.variable_types.variable.Categorical'>, 'affiliate_channel': <class 'featuretools.variable_types.variable.Categorical'>, 'affiliate_provider': <class 'featuretools.variable_types.variable.Categorical'>, 'first_affiliate_tracked': <class 'featuretools.variable_types.variable.Categorical'>, 'signup_app': <class 'featuretools.variable_types.variabl

### Relationships
In our problem, each user (and associated ID) has multiple session records. Here, the users entity is called **parent entity**, and the sessions entity is known as the **child entity** in this relationship. Similarly, every gender bucket has multiple examples in the users entity. Hence, buckets entity is the parent entity and the users entity is the child entity in this relationship. When specifying relationships we list the variable in the parent entity first. Note that each `ft.Relationship` must denote a **one-to-many** relationship rather than a relationship which is *one-to-one* or *many-to-many*.

Creating relationships between variables is pretty easy within `featuretools`, but it comes with two main problems:
1. The only allowed relationships are **one-to-many**, and this means that there will be scenarios, like the one we exhibited in *Manual Feature Engineering*, where we will have to mutate parameters until we reach such a relationship.
2. We can identify unique columns in our `pandas` data frame and these could potentially give us ideas of relationships where these unique columns constitute the *one* in the **one-to-many** relationships. The realization is that there is nothing automated here.

In [4]:
# Initialize one-to-many (in this exact order) relationships & add to training entity set
id_to_id = ft.Relationship(es['users']['id'], 
                           es['sessions']['user_id'])
bucket_to_bucket = ft.Relationship(es['buckets']['bucket_id'], 
                                   es['users']['age_gender_bucket'])
es.add_relationships([id_to_id, bucket_to_bucket])
# Initialize one-to-many (in this exact order) relationships & add to testing entity set
id_to_id_test = ft.Relationship(es_test['users']['id'], 
                                es_test['sessions']['user_id'])
bucket_to_bucket_test = ft.Relationship(es_test['buckets']['bucket_id'],
                                        es_test['users']['age_gender_bucket'])
es_test.add_relationships([id_to_id_test, bucket_to_bucket_test])
# Observe initialized relationships
print(es.relationships)
print(es_test.relationships)

[<Relationship: sessions.user_id -> users.id>, <Relationship: users.age_gender_bucket -> buckets.bucket_id>]
[<Relationship: sessions.user_id -> users.id>, <Relationship: users.age_gender_bucket -> buckets.bucket_id>]


### Feature Primitives

Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. There are two types of primitives supported by `featuretools`: i) **aggregation**, ii) **transform**. 

Aggregation primitives take related instances as an input and output a single value and they are applied across *multiple* entities which are described by a parent-child relationship in the entity set. Transform primitives each take one or more variables from a *single* entity as an input and output a new variable for that entity. Call `featuretools.list_primitives()` to see the full list of primitives, or check `featuretools_primitives.csv`. Some common aggregation primitives include `mean`, `median`, `count`, and `std`. Some common transformation primitives include `hour`, `day`, `month`, and `year`.

#### Defining Custom Primitives

We can also define our own primitives with `featuretools`. To define a primitive, a we must
* specify the type of primitive Aggregation or Transform
* define the input and output data types
* write a function in python to do the calculation
* annotate with attributes to constrain how it is applied

Let's implement two examples:
* Let's create a custom **transform** primitive and let's assume that we have a review (of the booking) of each user in text format in our data frames. Using [NRC Emotion Lexicon](http://web.stanford.edu/class/cs124/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt), we can count the number of occurrences words related *anger* for each user with the below function and calls. (NOTE: A few specific words are used here for reference.) The `AngerWordsCount()` class can now be used as any other transform primitive.
* Let's implement a custom **aggregation** primitive, and this time let's apply this on the actual information we get from the session informations for each user. Specifically, we will be trying to compute the number of times each user has spent over 1 hour at a single session in all types (as a single feature) and two particular types of 'action_detail', viewing search results or updating their wishlists (as two features). The `CountLongDreamingSessions()` class can now be used as any other aggregation primitive.

The second example also demonstrates the utilization of `interesting_values`, which allows us to specify to-be-focused values for a variable in an `Entity`, and the utilization of parameter `@where_primitives` aggregates on the to-be-focused values. Parameter `@agg_primitives`, on the other hand, ensures that we aggregate on all action detail types. Moreover, passing an empty list to this formerly mentioned parameter will **prevent** any type of aggregation to be performed with regards to this new primitive we defined. Finally, we are not *yet* calculating any actual feature matrix here as we will leave feature engineering to `featuretools` completely, which means that examples below will not be incorporated in the final data frame we pass to prediction.

In [5]:
from featuretools.primitives import make_agg_primitive, make_trans_primitive
from featuretools.variable_types import Text, Numeric, Categorical

# Example (1)
def anger_words_count(column, anger_words=['hate', 'furious', 'terrible', 'disgusting', 'shameful']):
    counts = [np.sum([review.lower().count(anger_word) for anger_word in anger_words]) for review in column]
    return counts

AngerWordsCount = make_trans_primitive(function=anger_words_count,
                                       input_types=[Text],
                                       return_type=Numeric)

# Example (2)
def count_long_dreaming_sessions(seconds_elapsed):
    df = pd.DataFrame({'seconds_elapsed': seconds_elapsed})
    return len(df[df['seconds_elapsed'] >= 3600.0])
                                    

CountLongDreamingSessions = make_agg_primitive(function=count_long_dreaming_sessions,
                                               input_types=[Numeric],
                                               return_type=Numeric)

es['sessions']['action_detail'].interesting_values = ['view_search_results', 'wishlist_content_update']

feature_definitions = ft.dfs(entityset=es,
                             target_entity="users",
                             agg_primitives=[CountLongDreamingSessions],
                             trans_primitives=[],
                             where_primitives=[CountLongDreamingSessions],
                             max_depth=1,
                             ignore_variables={'sessions': ['action', 'action_type', 'device_type']},
                             features_only=True)
print(feature_definitions)

[<Feature: gender>, <Feature: age>, <Feature: signup_method>, <Feature: language>, <Feature: affiliate_channel>, <Feature: affiliate_provider>, <Feature: first_affiliate_tracked>, <Feature: signup_app>, <Feature: first_device_type>, <Feature: first_browser>, <Feature: age_gender_bucket>, <Feature: signup_flow>, <Feature: COUNT_LONG_DREAMING_SESSIONS(sessions.secs_elapsed)>, <Feature: buckets.CA>, <Feature: buckets.DE>, <Feature: buckets.FR>, <Feature: buckets.GB>, <Feature: buckets.AU>, <Feature: buckets.NL>, <Feature: buckets.US>, <Feature: buckets.IT>, <Feature: buckets.PT>, <Feature: buckets.ES>, <Feature: COUNT_LONG_DREAMING_SESSIONS(sessions.secs_elapsed WHERE action_detail = wishlist_content_update)>, <Feature: COUNT_LONG_DREAMING_SESSIONS(sessions.secs_elapsed WHERE action_detail = view_search_results)>]


### Deep Feature Synthesis (DFS)

**Deep Feature Synthesis (DFS)** is at the heart of `featuretools`, and is an automated method for performing feature engineering on relational and temporal data. A data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer’s behavior; DFS aims to overcome the limits of time and imagination of a data scientist to create many new features from multiple related tables of data. The idea, in essense, is to stack multiple feature primitives (both aggregations and transformations) to create new features. Then, these new features and the old original features are combined in a single data frame which will be used for training predictive models.This idea is based on the paper *Deep Feature Synthesis: Towards Automating Data Science Endeavors* by Max Kanter (creator of `featuretools` as well). 

Here, it may also be worthwile to mention that a method called **Evolutionary Feature Synthesis (EFS)** also exists in literature, but is more focused on generating features based on a single data frame rather than multiple, relational ones. The paper *Building Predictive Models via Feature Synthesis* by Ignacio Arnaldo et al. describes the algorithm behind in detail.

#### Implementation Notes
DFS can be performed with a simple call to `ft.dfs()`. First, pass parameter `@features_only=True` to only get feature definitions without the feature matrix (like we have done in the above section) and hence the time-consuming calculations themselves for debugging and printing the feature synthesis before hand. For example, if you see that `TIME_SINCE_FIRST` (aggregation primitive) feature hasn't been computed for the entity set at hand even though we specified it in the above list, this would mean that the entity containing the parent side of the relationship contains no Datetime variables. This increases the robustness of `dfs()`, we can list a large amount of primitives and try to generate the largest (hopefully most informative) feature matrix without getting runtime errors. Morever, we should note that aggregate primitives such as `avg_time_between`, `time_since_first`, and `time_since_last` can come in handy in any kind of problem where a track of sessions are kept for the users.

In [6]:
# Specify aggregation primitives to be applied
aggregations = ['last', 'num_unique', 'skew', 'min', 'mean', 'count', 'std', 'max', 'median', 'mode']
# Specify transformation primitives to be applied
transformations = ['hour', 'day', 'week', 'month', 'year', 'is_weekend', 'cum_sum']
# Creating deep features and combining all primitive outputs & original features at a single training data frame
feature_matrix, feature_definitions = ft.dfs(entityset=es,
                                             target_entity='users',
                                             agg_primitives=aggregations,
                                             trans_primitives=transformations,
                                             ignore_variables={'sessions': ['data_id'], 'users':['id']},
                                             max_depth=2,
                                             verbose=True)
print(feature_definitions)

Built 138 features
Elapsed: 26:11 | Remaining: 00:00 | Progress: 100%|███████████████████████████████████████████████████████████████████| Calculated: 11/11 chunks
[<Feature: gender>, <Feature: age>, <Feature: signup_method>, <Feature: language>, <Feature: affiliate_channel>, <Feature: affiliate_provider>, <Feature: first_affiliate_tracked>, <Feature: signup_app>, <Feature: first_device_type>, <Feature: first_browser>, <Feature: age_gender_bucket>, <Feature: signup_flow>, <Feature: LAST(sessions.action)>, <Feature: LAST(sessions.action_type)>, <Feature: LAST(sessions.action_detail)>, <Feature: LAST(sessions.device_type)>, <Feature: LAST(sessions.secs_elapsed)>, <Feature: NUM_UNIQUE(sessions.action)>, <Feature: NUM_UNIQUE(sessions.action_type)>, <Feature: NUM_UNIQUE(sessions.action_detail)>, <Feature: NUM_UNIQUE(sessions.device_type)>, <Feature: SKEW(sessions.secs_elapsed)>, <Feature: MIN(sessions.secs_elapsed)>, <Feature: MEAN(sessions.secs_elapsed)>, <Feature: STD(sessions.secs_elapse

## Apply Equivalent Feature Extractions for Test Set
The `calculate_feature_matrix()` method applies the same synthesized feature mappings to a different `EntitySet`, and in our case this set was seperated as test. The `encode_features()` allows us to easily one-hot encode the variables. However, we will deal with data encoding in the next notbeook. 

In [7]:
# Keep the extracted variables in their raw form (without any encoding) and do the same for test set
feature_matrix_test = ft.calculate_feature_matrix(features=feature_definitions, entityset=es_test, verbose=True)

# One-hot encode categorical variables and do the same for test set
#feature_matrix_encoded, feature_encodings = ft.encode_features(feature_matrix, feature_definitions, verbose=True)
#feature_matrix_encoded_test = ft.calculate_feature_matrix(features=feature_encodings, entityset=es_test, verbose=True)

Elapsed: 25:12 | Remaining: 00:00 | Progress: 100%|███████████████████████████████████████████████████████████████████| Calculated: 11/11 chunks


## Converting Unknowns to NaNs
Although we spent some time imputing missing data values in the introductory notebook, we won't be imputing the missing values originating from the feature extractions we performed above. For example, users with no session information whatsoever will have a few columns that has the value NaN. Imputing for this would be not only inefficient, but it would also be illogical as the unknown values tell a very specific story here.

In [8]:
feature_matrix = unknown_to_nan(df=feature_matrix)
feature_matrix_test = unknown_to_nan(df=feature_matrix_test)

## Save Progress
With this second notebook, our saved data is gone through automated feature engineering based on DFS.

In [9]:
# Add back the response variables to sets & save progress
feature_matrix['country_destination'] = targets
feature_matrix.to_csv('(2)data_automated_ops/train_users.csv', index=None)
feature_matrix_test.to_csv('(2)data_automated_ops/test_users.csv', index=None)