# Advanced techniques

Let's explore feature engineering techniques with the house prices dataset from Kaggle.

We can find an illustrative example of how to use Deep feature synthesis [here](https://www.kaggle.com/willkoehrsen/featuretools-for-good), and a good explanation [here](https://stackoverflow.com/questions/52418152/featuretools-can-it-be-applied-on-a-single-table-to-generate-features-even-when).

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Advanced-techniques" data-toc-modified-id="Advanced-techniques-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Advanced techniques</a></span><ul class="toc-item"><li><span><a href="#Setup-the-dataset" data-toc-modified-id="Setup-the-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup the dataset</a></span></li><li><span><a href="#Build-the-EntitySet" data-toc-modified-id="Build-the-EntitySet-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Build the EntitySet</a></span><ul class="toc-item"><li><span><a href="#Normalize-the-entity" data-toc-modified-id="Normalize-the-entity-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Normalize the entity</a></span></li><li><span><a href="#Deep-feature-synthesis" data-toc-modified-id="Deep-feature-synthesis-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Deep feature synthesis</a></span></li></ul></li></ul></li></ul></div>

## Setup the dataset

In [1]:
import nbimporter
from src.dataset import Dataset
import warnings

"warnings.filterwarnings('ignore')"

houses = Dataset('./data/houseprices_prepared.csv.gz')
houses.describe()

Available types: [dtype('int64') dtype('O') dtype('float64')]
80 Features
43 categorical features
37 numerical features
16 categorical features with NAs
0 numerical features with NAs
64 Complete features
--
Target: Not set


We will replace the NA's in the dataset with 'None' or 'Unknown' since they're not really NA's. For no good reason the person in charge of encoding the file decided to assign NA's to values where the feature does not apply, but instead of using a value for that special condition (like the string 'None') he/she decided to use the actual NA.

In [2]:
houses.replace_na(column='Electrical', value='Unknown')
houses.replace_na(column=houses.names('categorical_na'), value='None')
houses.set_target('SalePrice')
houses.describe()

Available types: [dtype('int64') dtype('O') dtype('float64')]
80 Features
43 categorical features
37 numerical features
0 categorical features with NAs
0 numerical features with NAs
80 Complete features
--
Target: SalePrice


Split the dataset into train and test. Use default 20% split.

## Build the EntitySet

In [3]:
import featuretools as ft

es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id='houses', 
                              dataframe=houses.data,
                              index = 'Id')
es

  return f(*args, **kwds)


Entityset: None
  Entities:
    houses [Rows: 1460, Columns: 80]
  Relationships:
    No relationships

### Normalize the entity

In [7]:
es.normalize_entity(base_entity_id='houses', 
                    new_entity_id='houses_norm',
                    additional_variables = houses.names('all').remove('Id'),
                    index='Id')
es

Entityset: None
  Entities:
    houses [Rows: 1460, Columns: 80]
    houses_norm [Rows: 1460, Columns: 1]
  Relationships:
    houses.Id -> houses_norm.Id

### Deep feature synthesis

In [8]:
f_matrix, f_defs = ft.dfs(entityset=es,
                          target_entity='houses_norm', verbose=1)

Built 302 features
Elapsed: 00:20 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 10/10 chunks


remove new variables related to the target

In [10]:
drop_cols = []
for col in f_matrix:
    if col == 'SalePrice':
        pass
    else:
        if 'SalePrice' in col:
            drop_cols.append(col)
            
print('Need to drop columns:', drop_cols)
f_matrix = f_matrix[[x for x in f_matrix if x not in drop_cols]]

Need to drop columns: []


In [11]:
f_matrix.shape

(1460, 296)