# Loan Prediction Challenge: Feature Engineering

In this notebook, we will carry out feature engineering on the datasets and create a feature matrix for training and testing.

## Table of Contents
* **[Preprocessing](#preprocessing)**
  * [Join Datasets](#join-datasets)
  * [Entities and Entitysets](#entities-and-entitysets)
   * [Adding Entities](#adding-entities)
  * [Relationships](#relationships)
   * [Adding Relationships](#adding-relationships)
   * [Visualise `EntitySet`](#visualise-entityset)
  * [Feature Primitives](#feature-primitives)
* **[Deep Feature Synthesis (DFS)](#deep-feature-synthesis)**
  * [Selecting Primitives](#selecting-primitives)
  * [Run Full Deep Feature Synthesis](#run-full-deep-feature-synthesis)
  * [Save](#save)
* **[References](#references)**

In [1]:
# Header
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

In [2]:
# The usual suspects ...
import sys
import psutil
import numpy as np
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

from scipy import stats
from sklearn import preprocessing

In [3]:
# Reading in the data
# Training datasets:
traindemographics = pd.read_csv('../data/traindemographics.csv')
trainperf = pd.read_csv('../data/trainperf.csv')
trainprevloans = pd.read_csv('../data/trainprevloans.csv')

# Testing datasets:
testdemographics = pd.read_csv('../data/testdemographics.csv')
testperf = pd.read_csv('../data/testperf.csv')
testprevloans = pd.read_csv('../data/testprevloans.csv')

<a id='preprocessing'></a>
## Preprocessing

<a id='join-datasets'></a>
### Join Datasets

In [4]:
# Create target variable for test dataset
testperf['good_bad_flag'] = np.nan

# Join train and test datasets
demographics = traindemographics.append(testdemographics, ignore_index=True, sort=True)
performance = trainperf.append(testperf, ignore_index=True, sort=True)
prevloans = trainprevloans.append(testprevloans, ignore_index=True, sort=True)

In [5]:
demographics.columns, performance.columns, prevloans.columns

(Index(['bank_account_type', 'bank_branch_clients', 'bank_name_clients',
        'birthdate', 'customerid', 'employment_status_clients', 'latitude_gps',
        'level_of_education_clients', 'longitude_gps'],
       dtype='object'),
 Index(['approveddate', 'creationdate', 'customerid', 'good_bad_flag',
        'loanamount', 'loannumber', 'referredby', 'systemloanid', 'termdays',
        'totaldue'],
       dtype='object'),
 Index(['approveddate', 'closeddate', 'creationdate', 'customerid',
        'firstduedate', 'firstrepaiddate', 'loanamount', 'loannumber',
        'referredby', 'systemloanid', 'termdays', 'totaldue'],
       dtype='object'))

After combining the dat sets, the performance and prevloans datasets have the `systemloanid` as the unique identifier. On the other hand, the demographics dataset has no unique identifier. The `customerid` seems like a possible candidate for an index but it contains duplicate entries. We'll need to create a unique index for this dataset and give it a name when we create entities.

<a id='entities-and-entitysets'></a>
### Entities and Entitysets

In [6]:
# Create entityset
entityset = ft.EntitySet(id='customers')

<a id='adding-entities'></a>
#### Adding Entities

In [8]:
# Entities with a unique index
entityset = entityset.entity_from_dataframe(entity_id='loan_performance',
                                            dataframe=performance,
                                            index='customerid')
entityset = entityset.entity_from_dataframe(entity_id='previous_loans',
                                            dataframe=prevloans,
                                            index='systemloanid')

# Entities with no unique index
entityset = entityset.entity_from_dataframe(entity_id='customer_demographics',
                                            dataframe=demographics,
                                            make_index=True,
                                            index='demographicid')

# Show entityset
entityset

Entityset: customers
  Entities:
    loan_performance [Rows: 5818, Columns: 10]
    previous_loans [Rows: 24090, Columns: 12]
    customer_demographics [Rows: 5833, Columns: 10]
  Relationships:
    No relationships

<a id='relationships'></a>
### Relationships

From the dataset, performance seems to be the parent table with two unique identifiers: `customerid` and `systemloanid`. The demographics and prevloans datasets are child tables of performance dataset since the performance dataset has one row for each customer, while demographics and prevloans have multiple entries.

<a id='adding-relationships'></a>
#### Adding Relationships

For each relationship, we need to first specify the parent variable and then the child variable. Using an `EntitySet` that tracks the relationships will allow us to work at a higher level of abstraction, thinking about the entire dataset rather than each individual table. This will greatly increase our efficiency.

In [9]:
# Relationship between performance and demographics - `customerid`
r_perf_demo = ft.Relationship(entityset['loan_performance']['customerid'], 
                              entityset['customer_demographics']['customerid'])

# Relationship between performance and previous loans - `customerid`
r_perf_prev = ft.Relationship(entityset['loan_performance']['customerid'],
                              entityset['previous_loans']['customerid'])

In [10]:
# Add the defined relationships
entityset = entityset.add_relationships([r_perf_demo, r_perf_prev])

# Show entityset
entityset

Entityset: customers
  Entities:
    loan_performance [Rows: 5818, Columns: 10]
    previous_loans [Rows: 24090, Columns: 12]
    customer_demographics [Rows: 5833, Columns: 10]
  Relationships:
    customer_demographics.customerid -> loan_performance.customerid
    previous_loans.customerid -> loan_performance.customerid

**Note:**

We need to be careful not to create a [diamond graph](https://en.wikipedia.org/wiki/Diamond_graph) where there are multiple paths from a parent to a child - this results in ambiguity.


All entities in the entity can be linked through these relationships. In theory, we should be able to calculate features for any of the entities. However, in practice, we will only calculate features for the parent dataframe that will be used for training/testing. The end outcome will be a dataframe that has one row for each client in the parent with thousands of features for each individual.

<a id='visualise-entityset'></a>
#### Visualise `EntitySet`

<a id='feature-primitives'></a>
### Feature Primitives

A [feature primitive](https://docs.featuretools.com/automated_feature_engineering/primitives.html) is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, most of which are already used in manual feature engineering, that can be stacked on top of each other to create complex deep features. Feature primitives fall into two categories:

- **Aggregation**: a function that groups together children for each parent and calculates a statistic such as the mean, min, max, or standard deviation across the children. An example is the maximum previous loan amount for each client. An aggregation covers multiple tables using relationships between tables.

- **Transformation**: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.

In [11]:
# Listing the primitives in a dataframe
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100

primitives[primitives['type'] == 'aggregation'].head(10)

Unnamed: 0,name,type,description
0,all,aggregation,Calculates if all values are 'True' in a list.
1,percent_true,aggregation,Determines the percent of `True` values.
2,last,aggregation,Determines the last value in a list.
3,num_unique,aggregation,"Determines the number of distinct values, ignoring `NaN` values."
4,skew,aggregation,Computes the extent to which a distribution differs from a normal distribution.
5,min,aggregation,"Calculates the smallest value, ignoring `NaN` values."
6,avg_time_between,aggregation,Computes the average number of seconds between consecutive events.
7,mean,aggregation,Computes the average for a list of values.
8,trend,aggregation,Calculates the trend of a variable over time.
9,count,aggregation,"Determines the total number of values, excluding `NaN`."


In [12]:
primitives[primitives['type'] == 'transform'].head(10)

Unnamed: 0,name,type,description
20,cum_mean,transform,Calculates the cumulative mean.
21,modulo_by_feature,transform,Return the modulo of a scalar by each element in the list.
22,subtract_numeric_scalar,transform,Subtract a scalar from each element in the list.
23,less_than_equal_to_scalar,transform,Determines if values are less than or equal to a given scalar.
24,cum_sum,transform,Calculates the cumulative sum.
25,divide_by_feature,transform,Divide a scalar by each value in the list.
26,modulo_numeric,transform,Element-wise modulo of two lists.
27,less_than,transform,Determines if values in one list are less than another list.
28,less_than_scalar,transform,Determines if values are less than a given scalar.
29,minute,transform,Determines the minutes value of a datetime.


<a id='deep-feature-synthesis'></a>
## Deep Feature Synthesis (DFS)

[Deep Feature Synthesis (DFS)](https://www.featuretools.com/blog/deep-feature-synthesis) is the method Featuretools uses to make new features. DFS stacks feature primitives to form features with a "depth" equal to the number of primitives.

**Example**
- The maximum value of a client's previous loans, `MAX(prevloans.loan_amount)`, is a "deep feature" with a depth of 1.
- To create a feature of depth 2, we would stack primitives by taking the maximum value of a client's average monthly payments per previous loan, `MAX(prevloans(MEAN(installments.payment)))`. In manual feature engineering, this would require two seperate groupings and aggregations, and take more than 15 minutes to code per feature.

**Advantages of DFS**
- Allows us to overcome human limitations of time and creativity by building features that we would never be able to think of on our own, or would not have the patience to implement.
- DFS is applicable to any dataset with only very minor changes in syntax.

In [13]:
# Default primitives from featuretools
default_agg_primitives = ['sum', 'std', 'max', 'skew', 'min', 'mean', 'count', 'percent_true', 'num_unique', 'mode']
default_trans_primitives = ['day', 'year', 'month', 'weekday', 'haversine', 'num_words', 'num_characters']

# DFS with specified primitives
feature_names = ft.dfs(entityset=entityset,
                       target_entity='loan_performance',
                       trans_primitives=default_trans_primitives,
                       agg_primitives=default_agg_primitives,
                       where_primitives=[],
                       seed_features=[],
                       max_depth=2,
                       n_jobs=-1,
                       verbose=1,
                       features_only=True)

Built 107 features


<a id='selecting-primitives'></a>
### Selecting Primitives

For our actual set of features, we will use a select group of primitives rather than just the defaults.

In [14]:
# Specify primitives
agg_primitives = ['sum', 'max', 'min', 'mean', 'count', 'percent_true', 'num_unique', 'mode', 'median']
trans_primitives = ['year', 'month', 'day', 'percentile', 'and']

In [15]:
# DFS
feature_names = ft.dfs(entityset=entityset,
                       target_entity='loan_performance',
                       agg_primitives=agg_primitives,
                       trans_primitives=trans_primitives,
                       n_jobs=-1,
                       verbose=1,
                       features_only=True,
                       max_depth=2)

Built 162 features


In [16]:
# Save the features. We'll want to use them later on a seperate dataset
ft.save_features(feature_names, '../data/features.txt')

<a id='run-full-deep-feature-synthesis'></a>
### Run Full Deep Feature Synthesis

If we are content with the features that have been built, we can run DFS and create the feature matrix.

In [17]:
print('Total size of entityset: {:.2f} GB'.format(sys.getsizeof(entityset) / 1e9))
print('Total number of CPUs detected: {}'.format(psutil.cpu_count()))
print('Total size of system memory: {:.2f} GB'.format(psutil.virtual_memory().total / 1e9))

Total size of entityset: 0.01 GB
Total number of CPUs detected: 4
Total size of system memory: 7.76 GB


In [None]:
# Run DFS
feature_matrix, feature_names = ft.dfs(entityset=entityset,
                                       target_entity='loan_performance',
                                       agg_primitives=agg_primitives,
                                       trans_primitives=trans_primitives,
                                       n_jobs=1,
                                       verbose=1,
                                       features_only=False,
                                       max_depth=2,
                                       chunk_size=100)

Built 162 features
Elapsed: 03:45 | Progress:  58%|█████▊    | Remaining: 03:11

<a id='save'></a>
### Save

In [None]:
# Reformat
feature_matrix.reset_index(inplace=True)

# Save
feature_matrix.to_csv('../data/feature_matrix.csv', index=False)

<a id='references'></a>
## References

- [Deep Feature Synthesis - original paper](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf)