# Featuretools 

In this notebook, we will implement automated feature engineering with [Featuretools](https://docs.featuretools.com/#minute-quick-start) for the Costa Rican Household Poverty Challenge. The objective of this data science for good problem is to predict the poverty of households in Costa Rica. 

## Automated Feature Engineering

Automated feature engineering should be a _default_ part of your data science workflow. Manual feature engineering is limited both by human creativity and time constraints but automated methods have no such constraints. At the moment, Featuretools is the only open-source Python library available for automated feature engineering. This library is extremely easy to get started with and very powerful (as the score from this kernel illustrates). 

For anyone new to featuretools, check out the [documentation](https://docs.featuretools.com/getting_started/install.html) or an [introductory blog post here.](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219) 

In [None]:
import numpy as np 
import pandas as pd

import featuretools as ft

import warnings
warnings.filterwarnings('ignore')

We'll read in the data and join the training and testing set together. 

In [None]:
# Raw data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
test['Target'] = np.nan

data = train.append(test, sort = True)

In [None]:
train_valid = train.loc[train['parentesco1'] == 1, ['idhogar', 'Id', 'Target']].copy()
test_valid = test.loc[test['parentesco1'] == 1, ['idhogar', 'Id']].copy()

submission_base = test[['Id', 'idhogar']]

### Data Preprocessing 

These steps are laid out in the kernel [A Complete Introduction and Walkthrough](https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough).  They involve correcting missing values, creating a few features (that Featuretools can build on top of). 

In [None]:
mapping = {"yes": 1, "no": 0}

# Fill in the values with the correct mapping
data['dependency'] = data['dependency'].replace(mapping).astype(np.float64)
data['edjefa'] = data['edjefa'].replace(mapping).astype(np.float64)
data['edjefe'] = data['edjefe'].replace(mapping).astype(np.float64)

data[['dependency', 'edjefa', 'edjefe']].describe()

## Missing Values

In [None]:
data['v18q1'] = data['v18q1'].fillna(0)

# Fill in households that own the house with 0 rent payment
data.loc[(data['tipovivi1'] == 1), 'v2a1'] = 0

# Create missing rent payment column
data['v2a1-missing'] = data['v2a1'].isnull()

# If individual is over 19 or younger than 7 and missing years behind, set it to 0
data.loc[((data['age'] > 19) | (data['age'] < 7)) & (data['rez_esc'].isnull()), 'rez_esc'] = 0

# Add a flag for those between 7 and 19 with a missing value
data['rez_esc-missing'] = data['rez_esc'].isnull()

data.loc[data['rez_esc'] > 5, 'rez_esc'] = 5

## Domain Knowledge Feature Construction

In [None]:
# Difference between people living in house and household size
data['hhsize-diff'] = data['tamviv'] - data['hhsize']

elec = []

# Assign values
for i, row in data.iterrows():
    if row['noelec'] == 1:
        elec.append(0)
    elif row['coopele'] == 1:
        elec.append(1)
    elif row['public'] == 1:
        elec.append(2)
    elif row['planpri'] == 1:
        elec.append(3)
    else:
        elec.append(np.nan)
        
# Record the new variable and missing flag
data['elec'] = elec
data['elec-missing'] = data['elec'].isnull()

# Remove the electricity columns
# data = data.drop(columns = ['noelec', 'coopele', 'public', 'planpri'])

# Wall ordinal variable
data['walls'] = np.argmax(np.array(data[['epared1', 'epared2', 'epared3']]),
                           axis = 1)

# data = data.drop(columns = ['epared1', 'epared2', 'epared3'])

# Roof ordinal variable
data['roof'] = np.argmax(np.array(data[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
# data = data.drop(columns = ['etecho1', 'etecho2', 'etecho3'])

# Floor ordinal variable
data['floor'] = np.argmax(np.array(data[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
# data = data.drop(columns = ['eviv1', 'eviv2', 'eviv3'])

# Create new feature
data['walls+roof+floor'] = data['walls'] + data['roof'] + data['floor']

# No toilet, no electricity, no floor, no water service, no ceiling
data['warning'] = 1 * (data['sanitario1'] + 
                         (data['elec'] == 0) + 
                         data['pisonotiene'] + 
                         data['abastaguano'] + 
                         (data['cielorazo'] == 0))

# Owns a refrigerator, computer, tablet, and television
data['bonus'] = 1 * (data['refrig'] + 
                      data['computer'] + 
                      (data['v18q1'] > 0) + 
                      data['television'])

# Per capita features
data['phones-per-capita'] = data['qmobilephone'] / data['tamviv']
data['tablets-per-capita'] = data['v18q1'] / data['tamviv']
data['rooms-per-capita'] = data['rooms'] / data['tamviv']
data['rent-per-capita'] = data['v2a1'] / data['tamviv']

# Create one feature from the `instlevel` columns
data['inst'] = np.argmax(np.array(data[[c for c in data if c.startswith('instl')]]), axis = 1)
# data = data.drop(columns = [c for c in data if c.startswith('instlevel')])

data['escolari/age'] = data['escolari'] / data['age']
data['inst/age'] = data['inst'] / data['age']
data['tech'] = data['v18q'] + data['mobilephone']

print('Data shape: ', data.shape)

### Remove Squared Variables

The gradient boosting machine does not need the squared version of variables it if already has the original variables. 

In [None]:
data = data[[x for x in data if not x.startswith('SQB')]]
data = data.drop(columns = ['agesq'])
data.shape

## Remove Highly Correlated Columns

In [None]:
# Create correlation matrix
corr_matrix = data.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.975)]

print(f'There are {len(to_drop)} correlated columns to remove.')
print(to_drop)

In [None]:
data = data.drop(columns = to_drop)

#  Establish Correct Variable Types

We need to specify the correct variables types:

1. Individual Variables: these are characteristics of each individual rather than the household
    * Boolean: Yes or No (0 or 1)
    * Ordered Discrete: Integers with an ordering
2. Household variables
    * Boolean: Yes or No
    * Ordered Discrete: Integers with an ordering
    * Continuous numeric

Below we manually define the variables in each category. This is a little tedious, but also necessary.

In [None]:
import featuretools.variable_types as vtypes

In [None]:
hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2', 'v2a1-missing', 'elec-missing']

hh_ordered = [ 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1','r4m2','r4m3', 'r4t1',  'r4t2', 
              'r4t3', 'v18q1', 'tamhog','tamviv','hhsize','hogar_nin','hhsize-diff',
              'elec',  'walls', 'roof', 'floor', 'walls+roof+floor', 'warning', 'bonus',
              'hogar_adul','hogar_mayor','hogar_total',  'bedrooms', 'qmobilephone']

hh_cont = ['v2a1', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding',
          'phones-per-capita', 'tablets-per-capita', 'rooms-per-capita', 'rent-per-capita']

In [None]:
ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone', 'rez_esc-missing']

ind_ordered = ['age', 'escolari', 'rez_esc', 'inst', 'tech']

ind_cont = ['escolari/age', 'inst/age']

The cells below remove any columns that aren't in the data (these may have been removed due to correlation).

In [None]:
to_remove = []
for l in [hh_ordered, hh_bool, hh_cont, ind_bool, ind_ordered, ind_cont]:
    for c in l:
        if c not in data:
            to_remove.append(c)

In [None]:
for l in [hh_ordered, hh_bool, hh_cont, ind_bool, ind_ordered, ind_cont]:
    for c in to_remove:
        if c in l:
            l.remove(c)

The three columns not in the above lists are `Id`, `Idhogar`, and `Target`. 

In [None]:
len(hh_ordered+hh_bool+hh_cont+ind_bool+ind_ordered+ind_cont) == (data.shape[1] - 3)

Below we convert the `Boolean` variables to the correct type. 

In [None]:
for variable in (hh_bool + ind_bool):
    data[variable] = data[variable].astype('bool')

Then we convert the float variables.

In [None]:
for variable in (hh_cont + ind_cont):
    data[variable] = data[variable].astype(float)

Finally, the same with the ordinal variables.

In [None]:
for variable in (hh_ordered + ind_ordered):
    try:
        data[variable] = data[variable].astype(int)
    except Exception as e:
        print(f'Could not convert {variable} because of missing values.')

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data.dtypes.value_counts().plot.bar(edgecolor = 'k');
plt.title('Variable Type Distribution');

# EntitySet and Entities

An `EntitySet` in Featuretools holds all of the tables and the relationships between them. At the moment we only have a single table, but we can create multiple tables through normalization. We'll call the first table `data` since it contains all the information both at the individual level and at the household level.

In [None]:
es = ft.EntitySet(id = 'households')
es.entity_from_dataframe(entity_id = 'data', 
                         dataframe = data, 
                         index = 'Id')

# Normalize Household Table

Normalization allows us to create another table with one unique row per instance. In this case, the instances are households. The new table is derived from the `data` table and we need to bring along any of the household level variables. Since these are the same for all members of a household, we can directly add these as columns in the household table using `additional_variables`. The index of the household table is `idhogar` which uniquely identifies each household.  

All of the variable types have already been confirmed.

In [None]:
es.normalize_entity(base_entity_id='data', 
                    new_entity_id='household', 
                    index = 'idhogar', 
                    additional_variables = hh_bool + hh_ordered + hh_cont + ['Target'])
es

### Table Relationships

Normalizing the entity automatically adds in the relationship between the parent, `household`, and the child, `ind`. This relationship links the two tables and allows us to create "deep features" by aggregating individuals in each household.

# Deep Feature Synthesis

Here is where Featuretools gets to work. Using feature primitives, Deep Feature Synthesis can build hundreds (or 1000s as we will later see) of features from the relationships between tables and the columns in tables themselves. There are two types of primitives, which are operations applied to data:

* Transforms: applied to one or more columns in a _single table_ of data 
* Aggregations: applied across _multiple tables_ using the relationships between tables

We generate the features by calling `ft.dfs`. This build features using any of the applicable primitives for each column in the data. Featuretools uses the table relationships to aggregate features as required. For example, it will automatically aggregate the individual level data at the household level. 

To start with, we use the default `agg` and `trans` primitives in a call to `ft.dfs`.

In [None]:
# Deep Feature Synthesis
feature_matrix, feature_names = ft.dfs(entityset=es, 
                                       target_entity = 'household', 
                                       max_depth = 2, 
                                       verbose = 1, 
                                       n_jobs = -1, 
                                       chunk_size = 100)


In [None]:
all_features = [str(x.get_name()) for x in feature_names]
feature_matrix.head()

In [None]:
all_features[-10:]

We need to remove any columns containing derivations of the `Target`. These are created because some of transform primitives might have affected the `Target`.

In [None]:
drop_cols = []
for col in feature_matrix:
    if col == 'Target':
        pass
    else:
        if 'Target' in col:
            drop_cols.append(col)
            
print(drop_cols)            
feature_matrix = feature_matrix[[x for x in feature_matrix if x not in drop_cols]]         
feature_matrix.head()

Most of these features are aggregations we could have made ourselves. However, why go to the trouble if Featuretools can do that for us?

In [None]:
feature_matrix.shape

That one call alone gave us 147 features to train a model! This was only using the default primitives as well. We can use more primitives or write our own to build more features.

# Feature Selection

We can do some rudimentary feature selection, removing one of any pair of columns with a correlation greater than 0.99 (absolute value).

In [None]:
# Create correlation matrix
corr_matrix = feature_matrix.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] >= 0.99)]

print('There are {} columns with >= 0.99 correlation.'.format(len(to_drop)))
to_drop

In [None]:
feature_matrix = feature_matrix[[x for x in feature_matrix if x not in to_drop]]

In [None]:
feature_matrix.shape