# Representing Data with EntitySets

In [1]:
import featuretools as ft

An `EntitySet` is a collection of entities and the relationships between them.   
They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools take `entities` and `relationships` as separate arguments, it is recommended to create an `EntitySet`, so you can more easily manipulate your data as needed.

**`EntitySet`**
- Entities
- Relationships

### Raw Data

In [62]:
data = ft.demo.load_mock_customer()
customers_df = data['customers']
sessions_df = data['sessions']
transactions_df = data['transactions']
products_df = data['products']

In [63]:
# create the original dataframe for the examples 
transactions_df = transactions_df\
    .merge(sessions_df, on = 'session_id')\
    .merge(customers_df, on = 'customer_id')\
    .drop(columns = ['session_start', 'join_date'])

In [64]:
# show some entries of the data to be used 
# transactions & products
transactions_df.sample(10)

Unnamed: 0,transaction_id,session_id,transaction_time,product_id,amount,customer_id,device,zip_code
194,495,4,2014-01-01 00:48:45,4,90.69,3,mobile,2139
326,157,19,2014-01-01 04:30:50,1,110.87,2,tablet,2139
443,465,21,2014-01-01 05:07:40,1,54.66,4,desktop,60091
211,91,4,2014-01-01 01:07:10,2,143.93,3,mobile,2139
280,225,7,2014-01-01 01:42:55,3,71.53,2,desktop,2139
460,7,22,2014-01-01 05:26:05,3,83.33,4,tablet,60091
19,85,2,2014-01-01 00:20:35,4,148.14,1,desktop,60091
265,438,34,2014-01-01 08:43:15,4,100.04,3,desktop,2139
146,462,11,2014-01-01 02:49:00,1,27.46,5,tablet,2139
104,379,27,2014-01-01 06:40:50,4,131.83,1,desktop,60091


In [15]:
products_df

Unnamed: 0,product_id,brand
0,1,B
1,2,B
2,3,C
3,4,A
4,5,C


## Creating EntitySet

First : initialize the `EntitySet` and gives it an `id`

In [98]:
es = ft.EntitySet(id = 'transactions')

## Adding Entites

`.entity_from_dataframe()`
- `index` : specifies the column that uniquely identifies rows in the dataframe
- `time_index` : tells Featuretools when the data was created.
- `variable_types` : indicates that “product_id” should be interpreted as a Categorical variable, even though it just an integer in the underlying data.


In [99]:
es = es.entity_from_dataframe(entity_id="transactions",
                              dataframe=transactions_df,
                              index="transaction_id",
                              time_index="transaction_time",
                              variable_types={"product_id": ft.variable_types.Categorical})
es

Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 8]
  Relationships:
    No relationships

In [100]:
es["transactions"].variables

[<Variable: transaction_id (dtype = index)>,
 <Variable: session_id (dtype = numeric)>,
 <Variable: transaction_time (dtype: datetime_time_index, format: None)>,
 <Variable: amount (dtype = numeric)>,
 <Variable: customer_id (dtype = numeric)>,
 <Variable: device (dtype = categorical)>,
 <Variable: zip_code (dtype = categorical)>,
 <Variable: product_id (dtype = categorical)>]

In [101]:
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=products_df,
                              index="product_id")

es

Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 8]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

## Adding a relationship

We want to relate these two entities by the columns called “product_id” in each entity. 

Relationship:
- *product* : **parent entity** (each product has multiple transactions associated with it)
- *transactions* : **child entity** 

When specifying relationships we list the variable in the parent entity first. Note that each `ft.Relationship` **must denote a one-to-many** relationship rather than a relationship which is one-to-one or many-to-many.

In [102]:
new_relationship = ft.Relationship(es['products']['product_id'],
                                  es['transactions']['product_id'])
es.add_relationship(new_relationship)


Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 8]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

## Creating entity from existing table
In order to create a new entity and relationship for sessions, we “normalize” the transaction entity.

Two operations were performed:
- It created a new entity called “sessions” based on the “session_id” variable in “transactions”
- It added a relationship connecting “transactions” and “sessions”.

In [103]:
es.normalize_entity(base_entity_id="transactions",
                    new_entity_id="sessions",
                    index="session_id",
                    additional_variables=["device", "customer_id", "zip_code"])

Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id

--- 

Two more operations were performed:
- It removed “device”, “customer_id”, and “zip_code” from “transactions” and created a new variables in the sessions entity. This reduces redundant information as the those properties of a session don’t change between transactions.
- It created the “first_transactions_time” variable in the new sessions entity to indicate the beginning of a session. If we don’t want this variable to be created, we can set `make_time_index=False`.

In [104]:
es['transactions'].variables

[<Variable: transaction_id (dtype = index)>,
 <Variable: session_id (dtype = id)>,
 <Variable: transaction_time (dtype: datetime_time_index, format: None)>,
 <Variable: amount (dtype = numeric)>,
 <Variable: product_id (dtype = categorical)>]

In [105]:
es['sessions'].variables

[<Variable: session_id (dtype = index)>,
 <Variable: device (dtype = categorical)>,
 <Variable: customer_id (dtype = numeric)>,
 <Variable: zip_code (dtype = categorical)>,
 <Variable: first_transactions_time (dtype: datetime_time_index, format: None)>]

### Check the dataframes

In [106]:
es['transactions'].df.head(5)

Unnamed: 0,transaction_id,session_id,transaction_time,amount,product_id
352,352,1,2014-01-01 00:00:00,7.39,4
186,186,1,2014-01-01 00:01:05,147.23,4
319,319,1,2014-01-01 00:02:10,111.34,2
256,256,1,2014-01-01 00:03:15,78.15,4
449,449,1,2014-01-01 00:04:20,33.93,3


In [107]:
es['sessions'].df.head(5)

Unnamed: 0,session_id,device,customer_id,zip_code,first_transactions_time
1,1,desktop,1,60091,2014-01-01 00:00:00
2,2,desktop,1,60091,2014-01-01 00:17:20
3,3,mobile,5,2139,2014-01-01 00:28:10
4,4,mobile,3,2139,2014-01-01 00:43:20
5,5,tablet,2,2139,2014-01-01 01:10:25


**Create the customers entity**

In [108]:
es.normalize_entity(base_entity_id = 'sessions',
                   new_entity_id = 'custmers',
                   index = 'customer_id',
                   additional_variables = ["zip_code"],
                   make_time_index = False)

Entityset: transactions
  Entities:
    transactions [Rows: 500, Columns: 5]
    products [Rows: 5, Columns: 2]
    sessions [Rows: 35, Columns: 4]
    custmers [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> custmers.customer_id

## Using the EntitySet

In [109]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                     target_entity = 'products')

In [110]:
feature_matrix

Unnamed: 0_level_0,brand,SUM(transactions.amount),STD(transactions.amount),MAX(transactions.amount),SKEW(transactions.amount),MIN(transactions.amount),MEAN(transactions.amount),COUNT(transactions),NUM_UNIQUE(transactions.session_id),MODE(transactions.session_id),...,NUM_UNIQUE(transactions.MONTH(transaction_time)),NUM_UNIQUE(transactions.WEEKDAY(transaction_time)),NUM_UNIQUE(transactions.sessions.device),NUM_UNIQUE(transactions.sessions.customer_id),MODE(transactions.DAY(transaction_time)),MODE(transactions.YEAR(transaction_time)),MODE(transactions.MONTH(transaction_time)),MODE(transactions.WEEKDAY(transaction_time)),MODE(transactions.sessions.device),MODE(transactions.sessions.customer_id)
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,7046.84,40.23277,148.86,-0.027598,6.29,71.906531,98,31,4,...,1,1,3,5,1,2014,1,2,desktop,2
2,B,7247.48,39.083334,147.86,0.180324,8.19,75.494583,96,34,19,...,1,1,3,5,1,2014,1,2,desktop,1
3,C,7916.96,41.64718,149.95,-0.075324,5.81,82.468333,96,35,31,...,1,1,3,5,1,2014,1,2,desktop,1
4,A,8181.19,44.354276,149.02,0.153199,5.73,75.056789,109,34,30,...,1,1,3,5,1,2014,1,2,desktop,4
5,C,7498.0,44.686334,149.56,0.08786,5.6,74.237624,101,34,28,...,1,1,3,5,1,2014,1,2,desktop,1


In [111]:
feature_defs

[<Feature: brand>,
 <Feature: SUM(transactions.amount)>,
 <Feature: STD(transactions.amount)>,
 <Feature: MAX(transactions.amount)>,
 <Feature: SKEW(transactions.amount)>,
 <Feature: MIN(transactions.amount)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: COUNT(transactions)>,
 <Feature: NUM_UNIQUE(transactions.session_id)>,
 <Feature: MODE(transactions.session_id)>,
 <Feature: NUM_UNIQUE(transactions.DAY(transaction_time))>,
 <Feature: NUM_UNIQUE(transactions.YEAR(transaction_time))>,
 <Feature: NUM_UNIQUE(transactions.MONTH(transaction_time))>,
 <Feature: NUM_UNIQUE(transactions.WEEKDAY(transaction_time))>,
 <Feature: NUM_UNIQUE(transactions.sessions.device)>,
 <Feature: NUM_UNIQUE(transactions.sessions.customer_id)>,
 <Feature: MODE(transactions.DAY(transaction_time))>,
 <Feature: MODE(transactions.YEAR(transaction_time))>,
 <Feature: MODE(transactions.MONTH(transaction_time))>,
 <Feature: MODE(transactions.WEEKDAY(transaction_time))>,
 <Feature: MODE(transactions.sessions.devic