# SOLUCIÓN ESTÁNDAR EXTRACCIÓN DE FEATURES CON FEATURETOOLS (PYTHON)


### OBJETIVO

El objetivo de este notebook es proporcionar una solución estándar para la creación de variables partiendo de un único dataset.

### INTRODUCCION

Featuretools es la librería que vamos a utilizar. Depende de Deep Feature Synthesis para generar variables. Hay dos tipos de variables primitivas:
- Transformaciones: aplicadas a una o más columnas de un único dataset
- Agregaciones: se aplica en varias tablas en entidades con relación padre-hijo, como máximas ventas por cliente.

In [287]:
import pandas as pd
import numpy as np
import datetime
import featuretools as ft
import featuretools.variable_types as vtypes

### Ejemplo 1:

Supongamos que tenemos un dataset de coches como el que se muestra a continuación. Veamos cuantas variables es capaz de crearnos la librería *featuretools*. 

In [202]:
#Leemos el dataset
df = pd.read_csv('Automobile_data.csv')

In [203]:
#Consta de 205 filas y 26 columnas
df.shape

(205, 26)

In [148]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [204]:
#Insertamos una nueva columna que será el índice único
df.insert(0,'ID',range(0, (len(df))))

Antes de utilizar Deep Feature Synthesis es recomendable preparar los datos como un **EntitySet**.  
En primer lugar, creamos un entityset al que llamaremos *coches*.

### PASOS

In [208]:
#Create new entityset
es = ft.EntitySet(id="coches")

Ahora debemos añadir entidades. Cada una debe tener un índice (una columna con todos los elementos únicos).

In [209]:
es = es.entity_from_dataframe(entity_id="coches", index='ID', dataframe=df)


Como únicamente tenemos un dataset y no podemos establecer relaciones entre varias tablas, añadimos una nueva entidad que será una de nuestras columnas (variables) del dataset previo sobre la cual nos interesa agrupar. En este caso elegimos *make* (la marca del coche) y *bpdy-style*.

In [210]:
es = es.normalize_entity(base_entity_id="coches", new_entity_id="make", index="make")
es = es.normalize_entity(base_entity_id="coches", new_entity_id="body-style", index="body-style")

Observamos como se ha creado el EntitySet y se han añadido las endidades así como las relaciones entre ellas (Padre->Hijo). En este caso la variable 'make' está actuando como primary key de la primera tabla.

In [213]:
es

Entityset: coches
  Entities:
    coches [Rows: 205, Columns: 27]
    make [Rows: 22, Columns: 1]
    body-style [Rows: 5, Columns: 1]
  Relationships:
    coches.make -> make.make
    coches.body-style -> body-style.body-style

In [214]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="make")

El resultado es un dataframe de nuevas variables para cada marca de coche.

In [217]:
feature_matrix.head()

Unnamed: 0_level_0,SUM(coches.symboling),SUM(coches.wheel-base),SUM(coches.length),SUM(coches.width),SUM(coches.height),SUM(coches.curb-weight),SUM(coches.engine-size),SUM(coches.compression-ratio),SUM(coches.city-mpg),SUM(coches.highway-mpg),...,MODE(coches.drive-wheels),MODE(coches.engine-location),MODE(coches.engine-type),MODE(coches.num-of-cylinders),MODE(coches.fuel-system),MODE(coches.bore),MODE(coches.stroke),MODE(coches.horsepower),MODE(coches.peak-rpm),MODE(coches.price)
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
alfa-romero,7,271.7,508.8,193.7,150.0,7919,412,27.0,61,80,...,rwd,front,dohc,four,mpfi,3.47,2.68,111,5000,16500
audi,9,715.9,1286.8,481.0,381.0,19605,915,58.8,132,169,...,fwd,front,ohc,five,mpfi,3.19,3.4,110,5500,13950
bmw,3,825.3,1476.0,531.8,438.6,23435,1335,68.6,155,203,...,rwd,front,ohc,six,mpfi,3.31,3.19,121,4250,16430
chevrolet,3,277.4,455.8,187.5,157.2,5271,241,28.7,123,139,...,fwd,front,ohc,four,2bbl,3.03,3.11,70,5400,5151
dodge,9,855.1,1448.9,577.5,464.8,19362,924,77.71,252,307,...,fwd,front,ohc,four,2bbl,2.97,3.23,68,5500,12964


Aquí tendríamos todas las variables que se han creado

In [220]:
feature_defs

[<Feature: SUM(coches.symboling)>,
 <Feature: SUM(coches.wheel-base)>,
 <Feature: SUM(coches.length)>,
 <Feature: SUM(coches.width)>,
 <Feature: SUM(coches.height)>,
 <Feature: SUM(coches.curb-weight)>,
 <Feature: SUM(coches.engine-size)>,
 <Feature: SUM(coches.compression-ratio)>,
 <Feature: SUM(coches.city-mpg)>,
 <Feature: SUM(coches.highway-mpg)>,
 <Feature: STD(coches.symboling)>,
 <Feature: STD(coches.wheel-base)>,
 <Feature: STD(coches.length)>,
 <Feature: STD(coches.width)>,
 <Feature: STD(coches.height)>,
 <Feature: STD(coches.curb-weight)>,
 <Feature: STD(coches.engine-size)>,
 <Feature: STD(coches.compression-ratio)>,
 <Feature: STD(coches.city-mpg)>,
 <Feature: STD(coches.highway-mpg)>,
 <Feature: MAX(coches.symboling)>,
 <Feature: MAX(coches.wheel-base)>,
 <Feature: MAX(coches.length)>,
 <Feature: MAX(coches.width)>,
 <Feature: MAX(coches.height)>,
 <Feature: MAX(coches.curb-weight)>,
 <Feature: MAX(coches.engine-size)>,
 <Feature: MAX(coches.compression-ratio)>,
 <Feature

Si sólo queremos unas variables concretas lo podemos especificar

In [221]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="make", 
                                 agg_primitives = ['mean'])

In [222]:
feature_matrix

Unnamed: 0_level_0,MEAN(coches.symboling),MEAN(coches.wheel-base),MEAN(coches.length),MEAN(coches.width),MEAN(coches.height),MEAN(coches.curb-weight),MEAN(coches.engine-size),MEAN(coches.compression-ratio),MEAN(coches.city-mpg),MEAN(coches.highway-mpg)
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
alfa-romero,2.333333,90.566667,169.6,64.566667,50.0,2639.666667,137.333333,9.0,20.333333,26.666667
audi,1.285714,102.271429,183.828571,68.714286,54.428571,2800.714286,130.714286,8.4,18.857143,24.142857
bmw,0.375,103.1625,184.5,66.475,54.825,2929.375,166.875,8.575,19.375,25.375
chevrolet,1.0,92.466667,151.933333,62.5,52.4,1757.0,80.333333,9.566667,41.0,46.333333
dodge,1.0,95.011111,160.988889,64.166667,51.644444,2151.333333,102.666667,8.634444,28.0,34.111111
honda,0.615385,94.330769,160.769231,64.384615,53.238462,2096.769231,99.307692,9.215385,30.384615,35.461538
isuzu,0.75,94.825,163.775,63.55,52.225,2213.5,102.5,9.225,31.0,36.0
jaguar,0.0,109.333333,196.966667,69.933333,51.133333,4027.333333,280.666667,9.233333,14.333333,18.333333
mazda,1.117647,97.017647,170.805882,65.588235,53.358824,2297.823529,103.0,10.488235,25.705882,31.941176
mercedes-benz,0.0,110.925,195.2625,71.0625,55.725,3696.25,226.5,14.825,18.5,21.0


### Ejemplo 2:

Probemos otro dataset donde tenemos fechas incluídas.

In [223]:
casas = pd.read_csv('kc_house_data.csv')

In [224]:
casas.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [225]:
casas['date'] = pd.to_datetime(casas['date'], format= '%Y%m%dT%H%M%S')

In [226]:
#Insertamos una nueva columna que será el índice
casas.insert(0,'ID',range(0, (len(casas))))


Creamos el EntitySet y añadimos el dataset como entidad

In [253]:
es = ft.EntitySet(id="casas")
es = es.entity_from_dataframe(entity_id="casas", index='ID', dataframe=casas)


In [263]:
es

Entityset: casas
  Entities:
    casas [Rows: 21613, Columns: 22]
  Relationships:
    No relationships

Utilizando el atributo *variables* observamos cómo ha identificado cada columna (tipo)

In [268]:
es['casas'].variables 

[<Variable: ID (dtype = index)>,
 <Variable: id (dtype = numeric)>,
 <Variable: date (dtype: datetime, format: None)>,
 <Variable: price (dtype = numeric)>,
 <Variable: bedrooms (dtype = numeric)>,
 <Variable: bathrooms (dtype = numeric)>,
 <Variable: sqft_living (dtype = numeric)>,
 <Variable: sqft_lot (dtype = numeric)>,
 <Variable: floors (dtype = numeric)>,
 <Variable: waterfront (dtype = numeric)>,
 <Variable: view (dtype = numeric)>,
 <Variable: condition (dtype = numeric)>,
 <Variable: grade (dtype = numeric)>,
 <Variable: sqft_above (dtype = numeric)>,
 <Variable: sqft_basement (dtype = numeric)>,
 <Variable: yr_built (dtype = numeric)>,
 <Variable: yr_renovated (dtype = numeric)>,
 <Variable: zipcode (dtype = numeric)>,
 <Variable: lat (dtype = numeric)>,
 <Variable: long (dtype = numeric)>,
 <Variable: sqft_living15 (dtype = numeric)>,
 <Variable: sqft_lot15 (dtype = numeric)>]

En ocasiones, *featuretools* considera variables como numéricas cuando son categóricas. Por ello hay que especificarlo a la hora de crear la entidad. Esto ocurre con las variables **waterfront** o **view**, por ejemplo. Para ello, a la hora de crear la entidad hay que ser más explícito:

In [274]:
variable_types = { 'waterfront': vtypes.Categorical,
      'view': vtypes.Categorical}

es = es.entity_from_dataframe(entity_id="casas", index='ID', dataframe=casas, variable_types=variable_types)

In [276]:
es['casas'].variables 

[<Variable: ID (dtype = index)>,
 <Variable: id (dtype = numeric)>,
 <Variable: date (dtype: datetime, format: None)>,
 <Variable: price (dtype = numeric)>,
 <Variable: bedrooms (dtype = numeric)>,
 <Variable: bathrooms (dtype = numeric)>,
 <Variable: sqft_living (dtype = numeric)>,
 <Variable: sqft_lot (dtype = numeric)>,
 <Variable: floors (dtype = numeric)>,
 <Variable: condition (dtype = numeric)>,
 <Variable: grade (dtype = numeric)>,
 <Variable: sqft_above (dtype = numeric)>,
 <Variable: sqft_basement (dtype = numeric)>,
 <Variable: yr_built (dtype = numeric)>,
 <Variable: yr_renovated (dtype = numeric)>,
 <Variable: zipcode (dtype = numeric)>,
 <Variable: lat (dtype = numeric)>,
 <Variable: long (dtype = numeric)>,
 <Variable: sqft_living15 (dtype = numeric)>,
 <Variable: sqft_lot15 (dtype = numeric)>,
 <Variable: waterfront (dtype = categorical)>,
 <Variable: view (dtype = categorical)>]

In [277]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="casas")

In [278]:
len(feature_defs) #se nos han creado 24 features

24

In [279]:
feature_defs #Observamos como las últimas variables creadas corresponden separar la fecha en día, mes, año, y dia de la semana.

[<Feature: id>,
 <Feature: price>,
 <Feature: bedrooms>,
 <Feature: bathrooms>,
 <Feature: sqft_living>,
 <Feature: sqft_lot>,
 <Feature: floors>,
 <Feature: condition>,
 <Feature: grade>,
 <Feature: sqft_above>,
 <Feature: sqft_basement>,
 <Feature: yr_built>,
 <Feature: yr_renovated>,
 <Feature: zipcode>,
 <Feature: lat>,
 <Feature: long>,
 <Feature: sqft_living15>,
 <Feature: sqft_lot15>,
 <Feature: waterfront>,
 <Feature: view>,
 <Feature: DAY(date)>,
 <Feature: YEAR(date)>,
 <Feature: MONTH(date)>,
 <Feature: WEEKDAY(date)>]

In [280]:
feature_matrix.head()

Unnamed: 0_level_0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,condition,grade,sqft_above,...,lat,long,sqft_living15,sqft_lot15,waterfront,view,DAY(date),YEAR(date),MONTH(date),WEEKDAY(date)
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,7129300520,221900.0,3,1.0,1180,5650,1.0,3,7,1180,...,47.5112,-122.257,1340,5650,0,0,13,2014,10,0
1,6414100192,538000.0,3,2.25,2570,7242,2.0,3,7,2170,...,47.721,-122.319,1690,7639,0,0,9,2014,12,1
2,5631500400,180000.0,2,1.0,770,10000,1.0,3,6,770,...,47.7379,-122.233,2720,8062,0,0,25,2015,2,2
3,2487200875,604000.0,4,3.0,1960,5000,1.0,5,7,1050,...,47.5208,-122.393,1360,5000,0,0,9,2014,12,1
4,1954400510,510000.0,3,2.0,1680,8080,1.0,3,8,1680,...,47.6168,-122.045,1800,7503,0,0,18,2015,2,2


FeatureTools genera diferentes variables según el tipo de las columnas que tengamos:
- numéricas: SUM, STD, MAX, SKEW, MIN y MEAN
- categóricas: NUM_UNIQUE y MODE

### ¿Y para el test set?

Necesitamos aplicar las mismas transformaciones para el conjunto de test. Sin embargo, esto no es obvio.  
Se aconseja crear un EntitySet usando los datos de test y recalculando las mismas variables llamando a **ft.calculate_feature_matrix** con la lista de variables definidad previamente. Para ello necesitamos codificar esas variables en nuestro conjunto de train y guardar el resultado.

In [282]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs, include_unknown=False)

Esto sencillamente LabelBinarizes nuestras variables categóricas. Lo guardamos

In [286]:
X_train = feature_matrix_enc.copy()
X_train.shape

(21613, 54)

A continuación creamos un EntitySet para nuestro conjunto de test.

In [None]:
# creating and entity set 'es'
es_tst = ft.EntitySet(id = 'casas')
# adding a dataframe - TEST SET
es_tst.entity_from_dataframe(entity_id = 'casas', dataframe = X_test, index = 'ID')

A continuación podemos calcular la matriz de variables en nuestro test EntitySet y pasar la lista de variables guardadas de training.

In [None]:
feature_matrix_tst = ft.calculate_feature_matrix(features=features_enc, entityset=es_tst)

### Use Feature Selection to prune the features
Una vez hemos generado un gran número de variables nuevas, probablemente necesitamos hacer un proceso de reducción de las mismas. Seguramente muchas estarán altamente correlaciondas por lo que vamos a identificarlas y eliminarlas.

In [None]:
# Threshold for removing correlated variables 
threshold = 0.7  

# Absolute value correlation matrix 
corr_matrix = X_train.corr().abs() 
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Select columns with correlations above threshold collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]
X_train_flt = X_train.drop(columns = collinear_features)
X_test_flt = X_test.drop(columns = collinear_features)
X_train_flt.shape, X_test_flt.shape