Taken BigMart Sales dataset, Since we require a relational dataset having unique identifiers in it. 

In [4]:
pip install featuretools

Collecting featuretools
[?25l  Downloading https://files.pythonhosted.org/packages/a6/55/5e206fff0ecfec66e9d95248a9302b0d691f1312afa71e96323a83963e19/featuretools-0.20.0-py3-none-any.whl (287kB)
[K     |████████████████████████████████| 296kB 2.8MB/s 
Collecting distributed>=2.12.0
[?25l  Downloading https://files.pythonhosted.org/packages/70/70/cc541748094fb20ea45437da3c65cf28f8ff8d4afa54cd8ab776067859b1/distributed-2.29.0-py3-none-any.whl (653kB)
[K     |████████████████████████████████| 655kB 8.9MB/s 
Collecting contextvars; python_version < "3.7"
  Downloading https://files.pythonhosted.org/packages/83/96/55b82d9f13763be9d672622e1b8106c85acb83edd7cc2fa5bc67cd9877e9/contextvars-2.4.tar.gz
Collecting fsspec>=0.6.0; extra == "dataframe"
[?25l  Downloading https://files.pythonhosted.org/packages/4c/38/39b83c70ff47192255c15da1b602322cb9918682199d5c1d9cf128bdd531/fsspec-0.8.3-py3-none-any.whl (88kB)
[K     |████████████████████████████████| 92kB 5.9MB/s 
[?25hCollecting partd>=0.3

In [5]:
import featuretools as ft
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

In [6]:
test = pd.read_csv('/content/Test.csv')
train = pd.read_csv('/content/Train.csv')

In [7]:
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [15]:
target = train['Item_Outlet_Sales']
train.drop(['Item_Outlet_Sales'], axis=1, inplace=True)

In [13]:
train.shape

(8523, 12)

In [14]:
test.shape

(5681, 11)

In [17]:
primarykey_itemid = test['Item_Identifier']
primarykey_outletid = test['Outlet_Identifier']

In [18]:
train.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [19]:
test.isnull().sum()

Item_Identifier                 0
Item_Weight                   976
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1606
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [20]:
train['Item_Weight'].fillna(train['Item_Weight'].mean(), inplace = True)
test['Item_Weight'].fillna(test['Item_Weight'].mean(), inplace = True)
train['Outlet_Size'].fillna("missing", inplace = True)
test['Outlet_Size'].fillna("missing", inplace = True)

In feature tools to perform the feature engineering process further we need a unique identifier. Hence making a common id from the two primary keys.

In [21]:
combined_test_train = train.append(test, ignore_index = True)

In [22]:
combined_test_train['id'] = combined_test_train['Item_Identifier'] + combined_test_train['Outlet_Identifier']

In [23]:
combined_test_train.drop(['Item_Identifier'], axis=1, inplace=True)

We would have to create an EntitySet( structure having multiple dataframes and relationships between them) here. 

In [24]:
es = ft.EntitySet(id = 'sales')
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = combined_test_train, index = 'id')

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 11]
  Relationships:
    No relationships

Featuretools here gives a functionality to split the data into multiple tables. Hence creating a table for outlet so that a relationship could be established. 

In [25]:
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index = 'Outlet_Identifier', 
additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'])

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 7]
    outlet [Rows: 10, Columns: 5]
  Relationships:
    bigmart.Outlet_Identifier -> outlet.Outlet_Identifier

Now it has two entities having been related because of the outlet identifier. Now we will be using deep feature synthesis to create new features in the dataset

In [27]:
feature_matrix, feature_names = ft.dfs(entityset=es, 
target_entity = 'bigmart', 
max_depth = 2, 
verbose = 1, 
n_jobs = 3)

Built 33 features
Elapsed: 00:00 | Progress:   0%|          



EntitySet scattered to 2 workers in 2 seconds
Elapsed: 00:01 | Progress: 100%|██████████


In [28]:
feature_matrix.head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,outlet.Outlet_Establishment_Year,outlet.Outlet_Size,outlet.Outlet_Location_Type,outlet.Outlet_Type,outlet.COUNT(bigmart),outlet.MAX(bigmart.Item_MRP),outlet.MAX(bigmart.Item_Visibility),outlet.MAX(bigmart.Item_Weight),outlet.MEAN(bigmart.Item_MRP),outlet.MEAN(bigmart.Item_Visibility),outlet.MEAN(bigmart.Item_Weight),outlet.MIN(bigmart.Item_MRP),outlet.MIN(bigmart.Item_Visibility),outlet.MIN(bigmart.Item_Weight),outlet.MODE(bigmart.Item_Fat_Content),outlet.MODE(bigmart.Item_Type),outlet.NUM_UNIQUE(bigmart.Item_Fat_Content),outlet.NUM_UNIQUE(bigmart.Item_Type),outlet.SKEW(bigmart.Item_MRP),outlet.SKEW(bigmart.Item_Visibility),outlet.SKEW(bigmart.Item_Weight),outlet.STD(bigmart.Item_MRP),outlet.STD(bigmart.Item_Visibility),outlet.STD(bigmart.Item_Weight),outlet.SUM(bigmart.Item_MRP),outlet.SUM(bigmart.Item_Visibility),outlet.SUM(bigmart.Item_Weight)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
FDA15OUT049,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,1550,266.4884,0.18785,21.35,141.163199,0.059,12.803003,32.4558,0.0,4.555,Low Fat,Fruits and Vegetables,5,16,0.126294,0.790782,0.099024,62.144594,0.043924,4.650796,218802.9588,91.450099,19844.655
DRC01OUT018,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,1546,266.3226,0.188323,21.35,141.000899,0.059976,12.803638,31.89,0.0,4.555,Low Fat,Fruits and Vegetables,5,16,0.133528,0.783017,0.102602,62.022851,0.044489,4.650874,217987.3906,92.723425,19794.425
FDN15OUT049,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,1550,266.4884,0.18785,21.35,141.163199,0.059,12.803003,32.4558,0.0,4.555,Low Fat,Fruits and Vegetables,5,16,0.126294,0.790782,0.099024,62.144594,0.043924,4.650796,218802.9588,91.450099,19844.655
FDX07OUT010,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,missing,Tier 3,Grocery Store,925,266.6884,0.313935,21.35,141.159742,0.101939,12.72287,32.6558,0.0,4.61,Low Fat,Fruits and Vegetables,5,16,0.104693,0.776902,0.112759,62.010835,0.073604,4.67507,130572.7618,94.293418,11768.655
NCD19OUT013,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,1553,266.6884,0.185913,21.35,141.128428,0.060242,12.788139,31.49,0.0,4.555,Low Fat,Fruits and Vegetables,5,16,0.130888,0.759033,0.104392,62.140848,0.044005,4.650214,219172.4492,93.555174,19859.98


In [29]:
feature_matrix.shape

(14204, 33)

Deep feature synthesis here creates some new features. 

In [30]:
feature_names

[<Feature: Item_Weight>,
 <Feature: Item_Fat_Content>,
 <Feature: Item_Visibility>,
 <Feature: Item_Type>,
 <Feature: Item_MRP>,
 <Feature: Outlet_Identifier>,
 <Feature: outlet.Outlet_Establishment_Year>,
 <Feature: outlet.Outlet_Size>,
 <Feature: outlet.Outlet_Location_Type>,
 <Feature: outlet.Outlet_Type>,
 <Feature: outlet.COUNT(bigmart)>,
 <Feature: outlet.MAX(bigmart.Item_MRP)>,
 <Feature: outlet.MAX(bigmart.Item_Visibility)>,
 <Feature: outlet.MAX(bigmart.Item_Weight)>,
 <Feature: outlet.MEAN(bigmart.Item_MRP)>,
 <Feature: outlet.MEAN(bigmart.Item_Visibility)>,
 <Feature: outlet.MEAN(bigmart.Item_Weight)>,
 <Feature: outlet.MIN(bigmart.Item_MRP)>,
 <Feature: outlet.MIN(bigmart.Item_Visibility)>,
 <Feature: outlet.MIN(bigmart.Item_Weight)>,
 <Feature: outlet.MODE(bigmart.Item_Fat_Content)>,
 <Feature: outlet.MODE(bigmart.Item_Type)>,
 <Feature: outlet.NUM_UNIQUE(bigmart.Item_Fat_Content)>,
 <Feature: outlet.NUM_UNIQUE(bigmart.Item_Type)>,
 <Feature: outlet.SKEW(bigmart.Item_MRP)>