# Quick start guide  

https://docs.featuretools.com/#minute-quick-start

**Deep Feature Synthesis (DFS) to perform automated feature engineering.**

In [1]:
import featuretools as ft

In [2]:
# mock data
data = ft.demo.load_mock_customer()

**Prepare data**

According to the documentation, there are 3 tables in this dataset. (but in fact there are 4 tables)

In Featuretools, each table is `entity`.

- **customers**: unique customers who had sessions
- **sessions**: unique sessions and associated attributes
- **transactions**: list of events in this session


In [8]:
type(data)

dict

In [16]:
for ind, ent in enumerate(data.keys()):
    print('Entity {} : {}'.format(ind + 1, ent))

Entity 1 : customers
Entity 2 : sessions
Entity 3 : transactions
Entity 4 : products


In [17]:
customers_df = data['customers']
customers_df

Unnamed: 0,customer_id,zip_code,join_date
0,1,60091,2008-01-01
1,2,2139,2008-02-20
2,3,2139,2008-04-10
3,4,60091,2008-05-30
4,5,2139,2008-07-19


In [21]:
sessions_df = data['sessions']
sessions_df.sample(5)

Unnamed: 0,session_id,customer_id,device,session_start
29,30,4,desktop,2014-01-01 07:29:35
9,10,2,mobile,2014-01-01 02:31:40
12,13,1,desktop,2014-01-01 03:15:00
5,6,1,desktop,2014-01-01 01:22:20
1,2,1,desktop,2014-01-01 00:17:20


In [22]:
transactions_df = data['transactions']
transactions_df.sample(5)

Unnamed: 0,transaction_id,session_id,transaction_time,product_id,amount
450,268,32,2014-01-01 08:07:30,3,137.75
89,419,6,2014-01-01 01:36:25,3,36.68
471,121,34,2014-01-01 08:30:15,2,79.61
340,491,25,2014-01-01 06:08:20,4,77.13
142,20,10,2014-01-01 02:33:50,2,143.85


**First, specify a dictionary with all the entities in the dataset.**

In [23]:
entities = {
    "customers" : (customers_df, "customer_id"),
    "sessions" : (sessions_df, "session_id", "session_start"),
    "transactions" : (transactions_df, "transaction_id", "transaction_time")
}

**Second, specify how the entities are related. When 2 two entities have a one-to-many relationship, we call the “one” enitity, the “parent entity”. A relationship between a parent and child is defined like this:**

```
(parent_entity, parent_variable, child_entity, child_variable)
```

Two relationships in the dataset

In [30]:
relationships = [("sessions", "session_id", "transactions", "session_id"),
                 ("customers", "customer_id", "sessions", "customer_id")]

Better convenient API for *entities* and *relationships*:  
`EntitySet` : https://docs.featuretools.com/loading_data/using_entitysets.html

## **Run Deep Feature Synthesis**
Minimal input to DFS
- a set of entities
- a list of relationships
- "target_entity"

-> output : 
- feature matrix 
- the corresponding list of feature definitions

In [31]:
feature_matrix_customers, features_defs = ft.dfs(entities=entities, 
                                                 relationships=relationships,
                                                 target_entity="customers")

In [41]:
feature_matrix_customers

Unnamed: 0_level_0,zip_code,COUNT(sessions),NUM_UNIQUE(sessions.device),MODE(sessions.device),SUM(transactions.amount),STD(transactions.amount),MAX(transactions.amount),SKEW(transactions.amount),MIN(transactions.amount),MEAN(transactions.amount),...,NUM_UNIQUE(sessions.MODE(transactions.product_id)),NUM_UNIQUE(sessions.DAY(session_start)),NUM_UNIQUE(sessions.YEAR(session_start)),NUM_UNIQUE(sessions.MONTH(session_start)),NUM_UNIQUE(sessions.WEEKDAY(session_start)),MODE(sessions.MODE(transactions.product_id)),MODE(sessions.DAY(session_start)),MODE(sessions.YEAR(session_start)),MODE(sessions.MONTH(session_start)),MODE(sessions.WEEKDAY(session_start))
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60091,10,3,desktop,10236.77,42.673267,149.95,0.070041,5.6,78.143282,...,3,1,1,1,1,1,1,2014,1,2
2,2139,8,3,mobile,9118.81,43.204771,149.15,0.028647,5.81,74.744344,...,5,1,1,1,1,1,1,2014,1,2
3,2139,5,2,desktop,5758.24,40.127924,147.73,0.070814,6.78,73.82359,...,4,1,1,1,1,3,1,2014,1,2
4,60091,8,3,desktop,8205.28,41.857208,149.56,0.087986,5.73,73.921441,...,5,1,1,1,1,1,1,2014,1,2
5,2139,4,3,tablet,4571.37,42.656189,148.17,0.085883,5.91,78.816724,...,3,1,1,1,1,2,1,2014,1,2


In [42]:
print('Original number of features : {}'.format(customers_df.shape[1]))

Original number of features : 3


60 + new features createad to describe customers !

**Try with another target entity**

In [34]:
feature_matrix_sessions, features_defs = ft.dfs(entities=entities,
                                                relationships=relationships,
                                                target_entity="sessions")

In [36]:
feature_matrix_sessions.head(5)

Unnamed: 0_level_0,customer_id,device,SUM(transactions.amount),STD(transactions.amount),MAX(transactions.amount),SKEW(transactions.amount),MIN(transactions.amount),MEAN(transactions.amount),COUNT(transactions),NUM_UNIQUE(transactions.product_id),...,customers.SKEW(transactions.amount),customers.MIN(transactions.amount),customers.MEAN(transactions.amount),customers.COUNT(transactions),customers.NUM_UNIQUE(transactions.product_id),customers.MODE(transactions.product_id),customers.DAY(join_date),customers.YEAR(join_date),customers.MONTH(join_date),customers.WEEKDAY(join_date)
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,desktop,1245.54,41.583078,147.23,-0.067531,5.6,77.84625,16,5,...,0.070041,5.6,78.143282,131,5,3,1,2008,1,1
2,1,desktop,895.33,43.095021,148.14,-0.395358,8.67,89.533,10,4,...,0.070041,5.6,78.143282,131,5,3,1,2008,1,1
3,5,mobile,939.82,39.467434,141.66,0.830112,20.91,67.13,14,5,...,0.085883,5.91,78.816724,58,5,2,19,2008,7,5
4,3,mobile,2054.32,45.708958,147.73,-0.215072,8.7,82.1728,25,5,...,0.070814,6.78,73.82359,78,5,5,10,2008,4,3
5,2,tablet,715.35,39.413598,124.29,0.102851,6.29,65.031818,11,5,...,0.028647,5.81,74.744344,122,5,1,20,2008,2,2


In [44]:
print('Original number of features : {}'.format(sessions_df.shape[1]))

Original number of features : 4


### Next?

- Learn about Representing Data with [EntitySets](https://docs.featuretools.com/loading_data/using_entitysets.html)
- Apply automated feature engineering with [Deep Feature Synthesis](https://docs.featuretools.com/automated_feature_engineering/afe.html)