# Deep Feature Synthesis
Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on 
- relational data 
- temporal data.

## Input data

In [1]:
import featuretools as ft

In [2]:
es = ft.demo.load_mock_customer(return_entityset=True)
es

Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    customers [Rows: 5, Columns: 3]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

## Running DFS

Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer’s behavior. In this example, an expert might be interested in features such as: *total number of sessions or month the customer signed up*.

These features can be generated by DFS when we specify the target_entity as **customers** and **"count"** and **"month"** as primitives.

In [6]:
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                     target_entity = 'customers',
                                     agg_primitives = ['count'],
                                     trans_primitives = ['month'],
                                     max_depth = 1)

In [7]:
feature_matrix

Unnamed: 0_level_0,zip_code,COUNT(sessions),MONTH(join_date)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,60091,10,1
2,2139,8,2
3,2139,5,4
4,60091,8,5
5,2139,4,7


- **aggregation primitive** : "count" (because it computes a single value based on many sessions related to one customer.)
- **transform primitive** : "month" (because it takes one value for a customer transforms it to another.)

## Creating **"Deep Features"**

The name Deep Feature Synthesis comes from the algorithm’s ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the “depth” of a feature. The `max_depth` parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with `max_depth=2`

In [9]:
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                      target_entity = 'customers', 
                                      agg_primitives = ['mean', 'sum', 'mode'], 
                                      trans_primitives = ['month', 'hour'], 
                                      max_depth = 2)

In [10]:
feature_matrix

Unnamed: 0_level_0,zip_code,MODE(sessions.device),MEAN(transactions.amount),SUM(transactions.amount),MODE(transactions.product_id),MONTH(join_date),HOUR(join_date),MEAN(sessions.MEAN(transactions.amount)),MEAN(sessions.SUM(transactions.amount)),SUM(sessions.MEAN(transactions.amount)),MODE(sessions.MODE(transactions.product_id)),MODE(sessions.MONTH(session_start)),MODE(sessions.HOUR(session_start))
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,60091,desktop,78.143282,10236.77,3,1,0,79.197651,1023.677,791.976505,1,1,0
2,2139,mobile,74.744344,9118.81,1,2,0,74.530438,1139.85125,596.243506,1,1,1
3,2139,desktop,73.82359,5758.24,5,4,0,73.954024,1151.648,369.770121,3,1,8
4,60091,desktop,73.921441,8205.28,4,5,0,73.084141,1025.66,584.673126,1,1,3
5,2139,tablet,78.816724,4571.37,2,7,0,78.362236,1142.8425,313.448942,2,1,0


In [11]:
feature_defs

[<Feature: zip_code>,
 <Feature: MODE(sessions.device)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: MODE(transactions.product_id)>,
 <Feature: MONTH(join_date)>,
 <Feature: HOUR(join_date)>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.SUM(transactions.amount))>,
 <Feature: SUM(sessions.MEAN(transactions.amount))>,
 <Feature: MODE(sessions.MODE(transactions.product_id))>,
 <Feature: MODE(sessions.MONTH(session_start))>,
 <Feature: MODE(sessions.HOUR(session_start))>]

With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this [paper](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). In the returned feature matrix, let us understand one of the depth 2 features

In [12]:
feature_matrix[['MEAN(sessions.SUM(transactions.amount))']]

Unnamed: 0_level_0,MEAN(sessions.SUM(transactions.amount))
customer_id,Unnamed: 1_level_1
1,1023.677
2,1139.85125
3,1151.648
4,1025.66
5,1142.8425


For each customer this feature

1. calculates the **sum** of all transaction amounts per session to get total amount per session,
2. then applies the **mean** to the total amounts across multiple sessions to identify the average amount spent per session

We call this feature a “deep feature” with a depth of 2.

Let’s look at another depth 2 feature that calculates for every customer *the most common hour of the day when they start a session*

In [13]:
feature_matrix[['MODE(sessions.HOUR(session_start))']]

Unnamed: 0_level_0,MODE(sessions.HOUR(session_start))
customer_id,Unnamed: 1_level_1
1,0
2,1
3,8
4,3
5,0


For each customer this feature calculates

1. The **hour** of the day each of his or her sessions started, then
2. uses the statistical function **mode** to identify the most common hour he or she started a session

Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning.

## Changing target entity
Create a feature matrix for the entity `session`

In [16]:
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                       target_entity = 'sessions', 
                                       agg_primitives = ['mean', 'sum', 'mode'], 
                                       trans_primitives = ['month', 'hour'],
                                       max_depth = 2)

In [17]:
feature_matrix

Unnamed: 0_level_0,customer_id,device,MEAN(transactions.amount),SUM(transactions.amount),MODE(transactions.product_id),MONTH(session_start),HOUR(session_start),customers.zip_code,MODE(transactions.MONTH(transaction_time)),MODE(transactions.HOUR(transaction_time)),MODE(transactions.products.brand),customers.MODE(sessions.device),customers.MEAN(transactions.amount),customers.SUM(transactions.amount),customers.MODE(transactions.product_id),customers.MONTH(join_date),customers.HOUR(join_date)
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,1,desktop,77.84625,1245.54,2,1,0,60091,1,0,C,desktop,78.143282,10236.77,3,1,0
2,1,desktop,89.533,895.33,3,1,0,60091,1,0,C,desktop,78.143282,10236.77,3,1,0
3,5,mobile,67.13,939.82,5,1,0,2139,1,0,C,tablet,78.816724,4571.37,2,7,0
4,3,mobile,82.1728,2054.32,1,1,0,2139,1,0,C,desktop,73.82359,5758.24,5,4,0
5,2,tablet,65.031818,715.35,1,1,1,2139,1,1,B,mobile,74.744344,9118.81,1,2,0
6,1,desktop,70.699412,1201.89,1,1,1,60091,1,1,B,desktop,78.143282,10236.77,3,1,0
7,2,desktop,71.148571,996.08,4,1,1,2139,1,1,A,mobile,74.744344,9118.81,1,2,0
8,2,mobile,63.326111,1139.87,5,1,1,2139,1,2,C,mobile,74.744344,9118.81,1,2,0
9,1,desktop,83.244667,1248.67,1,1,2,60091,1,2,B,desktop,78.143282,10236.77,3,1,0
10,2,mobile,66.718667,1000.78,1,1,2,2139,1,2,B,mobile,74.744344,9118.81,1,2,0


In [18]:
features_defs

[<Feature: customer_id>,
 <Feature: device>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: MODE(transactions.product_id)>,
 <Feature: MONTH(session_start)>,
 <Feature: HOUR(session_start)>,
 <Feature: customers.zip_code>,
 <Feature: MODE(transactions.MONTH(transaction_time))>,
 <Feature: MODE(transactions.HOUR(transaction_time))>,
 <Feature: MODE(transactions.products.brand)>,
 <Feature: customers.MODE(sessions.device)>,
 <Feature: customers.MEAN(transactions.amount)>,
 <Feature: customers.SUM(transactions.amount)>,
 <Feature: customers.MODE(transactions.product_id)>,
 <Feature: customers.MONTH(join_date)>,
 <Feature: customers.HOUR(join_date)>]

As we can see, DFS will also build deep features based on a parent entity, in this case the customer of a particular session. 

- For example, the feature below calculates the mean transaction amount of the customer of the session.

In [26]:
feature_matrix[['customers.MEAN(transactions.amount)']].head(5)

Unnamed: 0_level_0,customers.MEAN(transactions.amount)
session_id,Unnamed: 1_level_1
1,78.143282
2,78.143282
3,78.816724
4,73.82359
5,74.744344


## Improve feature output
[Tunning DFS](https://docs.featuretools.com/guides/tuning_dfs.html)