In [None]:
!pip install featuretools

Collecting featuretools
  Downloading featuretools-1.31.0-py3-none-any.whl.metadata (15 kB)
Collecting woodwork>=0.28.0 (from featuretools)
  Downloading woodwork-0.31.0-py3-none-any.whl.metadata (10 kB)
Downloading featuretools-1.31.0-py3-none-any.whl (587 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.9/587.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading woodwork-0.31.0-py3-none-any.whl (215 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.2/215.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: woodwork, featuretools
Successfully installed featuretools-1.31.0 woodwork-0.31.0


In [None]:
import pandas as pd
import numpy as np
import featuretools as ft
import warnings

warnings.filterwarnings("ignore")

from woodwork.logical_types import (
    Datetime,
    Categorical,
    PostalCode
)
from featuretools.primitives import AggregationPrimitive
from woodwork.column_schema import ColumnSchema

# Feature tools

## 1. Qué es feature tools?

featuretools es una librería de Python diseñada para automatizar la creación de features (variables o atributos) a partir de datos relacionales. En lugar de generar features manualmente, lo que puede ser tedioso y propenso a errores, especialmente con datasets complejos, featuretools utiliza algoritmos para descubrir y construir automáticamente nuevas features a partir de las existentes en diferentes tablas de una base de datos. Esto acelera significativamente el proceso de ingeniería de features, permitiendo a los científicos de datos explorar un espacio de features mucho más amplio y mejorar la precisión de los modelos de machine learning. En resumen, simplifica y automatiza la fase crucial de la ingeniería de features en proyectos de ciencia de datos.

![a](pics/features_1.png)

## 2. Entity set


En featuretools, una entidad representa un esquema de datos relacional. La forma de representar datos a través de entidades es mapeando cada tabla (dataframe) de tu dataset a una entidad en featuretools.

Para representar datos a través de entidades necesitamos que cada "tabla" contenga:
- Nombre: Nombre de la tabla en la entidad.
- Indice: Llave primaria.
- En caso de tener una columna que represente temporalidades transaccionales, esta columna se debe establecer como time_index o renombrarse con la palabra "time".

Si tus datos están relacionados (como en una base de datos relacional), las relaciones entre las tablas (dataframes) se representan como relaciones en la entidad. Esto permite a la librería comprender cómo se conectan las diferentes tablas y generar features que combinen información de varias de ellas.

In [None]:
#Load sample data
data = ft.demo.load_mock_customer()
transactions_df = (
    data["transactions"]
    .merge(data["sessions"])
    .merge(data["customers"])
    .assign(
        transaction_time = lambda x: pd.to_datetime(x['transaction_time']).map(lambda y: y.normalize()),
        session_start = lambda x: pd.to_datetime(x['session_start']).map(lambda y: y.normalize()),
        join_date = lambda x: pd.to_datetime(x['join_date']).map(lambda y: y.normalize()),
        birthday = lambda x: pd.to_datetime(x['birthday']).map(lambda y: y.normalize()),
    )
)
transactions_df.head()

Unnamed: 0,transaction_id,session_id,transaction_time,product_id,amount,customer_id,device,session_start,zip_code,join_date,birthday
0,2,1,2014-01-01,5,127.64,2,desktop,2014-01-01,13244,2012-04-15,1986-08-18
1,495,1,2014-01-01,2,109.48,2,desktop,2014-01-01,13244,2012-04-15,1986-08-18
2,341,1,2014-01-01,3,95.06,2,desktop,2014-01-01,13244,2012-04-15,1986-08-18
3,308,1,2014-01-01,4,78.92,2,desktop,2014-01-01,13244,2012-04-15,1986-08-18
4,271,1,2014-01-01,3,31.54,2,desktop,2014-01-01,13244,2012-04-15,1986-08-18


In [None]:
products_df = data["products"]
products_df

Unnamed: 0,product_id,brand
0,1,B
1,2,B
2,3,B
3,4,B
4,5,A


### Crear entidad

In [None]:
es = ft.EntitySet(id="customer_data")

### Agregar frames

In [None]:
es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": Categorical,
        "zip_code": PostalCode,
        "transaction_time":Datetime,
        "session_start":Datetime,
        "join_date":Datetime,
        "birthday":Datetime,
    },
)
es["transactions"].ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
transaction_id,Integer,['index']
session_id,Integer,['numeric']
transaction_time,Datetime,['time_index']
product_id,Categorical,['category']
amount,Double,['numeric']
customer_id,Integer,['numeric']
device,Categorical,['category']
session_start,Datetime,[]
zip_code,PostalCode,['category']
join_date,Datetime,[]


In [None]:
es = es.add_dataframe(
    dataframe_name="products",
    dataframe=products_df,
    index="product_id",
    logical_types={
        "product_id": Categorical,
        "brand": Categorical
    }
)

### Add relationship

In [None]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es

Entityset: customer_data
  DataFrames:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

## 3. Primitives

Las primitives en featuretools son funciones que se aplican a las columnas de las entidades para generar nuevas features. Son los bloques de construcción básicos con los que featuretools crea automáticamente nuevas variables. Estas funciones pueden ser estadísticas simples (como la media, la suma, el conteo, el máximo, el mínimo), transformaciones (como el logaritmo, la normalización), o funciones más complejas que dependen de datos de múltiples tablas relacionadas.

- Aggregation: Estas primitives resumen los datos de una entidad hija (la tabla relacionada) en función de los valores de una entidad padre (la tabla principal). Calculan una estadística (suma, media, conteo, etc.) sobre un conjunto de filas de la entidad hija que están relacionados con una única fila de la entidad padre.

- Tranform:  Estas primitives transforman los valores de una columna dentro de la misma entidad. No necesitan relaciones entre entidades para operar.

- Custom: Primitives propias

| Característica | Aggregation Primitives | Transform Primitives |
|---|---|---|
| **Operación** | Resumen de datos de entidades relacionadas (agrupación) | Transformación de valores dentro de una misma entidad |
| **Relaciones entre entidades** | Requiere relaciones | No requiere relaciones |
| **Resultado** | Reduce la dimensionalidad agregando información | Genera nuevas columnas en la misma entidad |
| **Ejemplos** | `sum(monto)`, `mean(precio)`, `count(transacciones)` | `year(fecha)`, `month(fecha)`, `log(valor)` |ft.list_primitives().head(5)

In [None]:
ft.list_primitives().type.value_counts()

type
transform      138
aggregation     65
Name: count, dtype: int64

### Aggregation primitives

In [None]:
ft.list_primitives().query('type =="aggregation"').sample(10)

Unnamed: 0,name,type,description,valid_inputs,return_type
35,trend,aggregation,Calculates the trend of a column over time.,"<ColumnSchema (Semantic Tags = ['numeric'])>, ...",<ColumnSchema (Semantic Tags = ['numeric'])>
45,max_min_delta,aggregation,Determines the difference between the max and ...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Semantic Tags = ['numeric'])>
46,n_most_common,aggregation,Determines the `n` most common elements.,<ColumnSchema (Semantic Tags = ['category'])>,
18,mode,aggregation,Determines the most commonly repeated value.,<ColumnSchema (Semantic Tags = ['category'])>,
51,is_monotonically_increasing,aggregation,Determines if a series is monotonically increa...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Logical Type = BooleanNullable)>
15,num_true,aggregation,Counts the number of `True` values.,"<ColumnSchema (Logical Type = Boolean)>, <Colu...",<ColumnSchema (Logical Type = IntegerNullable)...
38,max_consecutive_zeros,aggregation,Determines the maximum number of consecutive z...,"<ColumnSchema (Logical Type = Integer)>, <Colu...",<ColumnSchema (Logical Type = Integer) (Semant...
25,kurtosis,aggregation,Calculates the kurtosis for a list of numbers,<ColumnSchema (Logical Type = Integer) (Semant...,<ColumnSchema (Logical Type = Double) (Semanti...
32,n_unique_days_of_month,aggregation,Determines the number of unique days of month.,<ColumnSchema (Logical Type = Datetime)>,<ColumnSchema (Logical Type = Integer) (Semant...
9,count_above_mean,aggregation,Calculates the number of values that are above...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Logical Type = IntegerNullable)...


### Tranform primitives

In [None]:
ft.list_primitives().query('type =="transform"').sample(10)

Unnamed: 0,name,type,description,valid_inputs,return_type
154,url_to_protocol,transform,Determines the protocol (http or https) of a url.,<ColumnSchema (Logical Type = URL)>,<ColumnSchema (Logical Type = Categorical) (Se...
89,is_year_start,transform,Determines if a date falls on the start of a y...,<ColumnSchema (Logical Type = Datetime)>,<ColumnSchema (Logical Type = BooleanNullable)>
162,cityblock_distance,transform,Calculates the distance between points in a ci...,<ColumnSchema (Logical Type = LatLong)>,<ColumnSchema (Logical Type = Double) (Semanti...
199,full_name_to_title,transform,Determines the title from a person's name.,<ColumnSchema (Logical Type = PersonFullName)>,<ColumnSchema (Logical Type = Categorical) (Se...
130,median_word_length,transform,Determines the median word length.,<ColumnSchema (Logical Type = NaturalLanguage)>,<ColumnSchema (Logical Type = Double) (Semanti...
168,is_free_email_domain,transform,Determines if an email address is from a free ...,<ColumnSchema (Logical Type = EmailAddress)>,<ColumnSchema (Logical Type = BooleanNullable)>
141,rate_of_change,transform,Computes the rate of change of a value per sec...,"<ColumnSchema (Semantic Tags = ['numeric'])>, ...",<ColumnSchema (Logical Type = Double) (Semanti...
140,expanding_count,transform,Computes the expanding count of events over a ...,<ColumnSchema (Logical Type = Datetime) (Seman...,<ColumnSchema (Logical Type = IntegerNullable)...
69,number_of_unique_words,transform,Determines the number of unique words in a str...,<ColumnSchema (Logical Type = NaturalLanguage)>,<ColumnSchema (Logical Type = IntegerNullable)...
72,same_as_previous,transform,Determines if a value is equal to the previous...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Logical Type = BooleanNullable)>


### Custom primitives

In [None]:
def calculate_rolling_mean_change(series):
    clean_serie = (
        series[~pd.isna(series)]
        .sort_index()
    )
    if clean_serie.shape[0] == 0:
        return np.nan
    else:
        if clean_serie.shape[0] <= 1:
            return np.nan
        else:
            rolling_mean = clean_serie.rolling(window=3,min_periods=1).mean()
            if rolling_mean.iloc[-2] == 0 or pd.isna(rolling_mean.iloc[-2]):
                return np.nan
            else:
                return rolling_mean.iloc[-1] / rolling_mean.iloc[-2]

class RollingMeanChange(AggregationPrimitive):

    name = "rolling_mean_change"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the rolling mean change of {} over time"
    default_value = np.nan

    def get_function(self):
        def pd_rolling_mean_change(y, x):
            return calculate_rolling_mean_change(pd.Series(data=y.values, index=x.values))

        return pd_rolling_mean_change


### Create features

In [None]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="products",
    max_depth=1,
)
feature_matrix

Unnamed: 0_level_0,brand,COUNT(transactions),MAX(transactions.amount),MAX(transactions.customer_id),MAX(transactions.session_id),MEAN(transactions.amount),MEAN(transactions.customer_id),MEAN(transactions.session_id),MIN(transactions.amount),MIN(transactions.customer_id),...,NUM_UNIQUE(transactions.zip_code),SKEW(transactions.amount),SKEW(transactions.customer_id),SKEW(transactions.session_id),STD(transactions.amount),STD(transactions.customer_id),STD(transactions.session_id),SUM(transactions.amount),SUM(transactions.customer_id),SUM(transactions.session_id)
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,102,149.56,5.0,35.0,73.429314,2.921569,17.735294,6.84,1.0,...,2,0.125525,-0.054764,0.036985,42.479989,1.404986,10.640783,7489.79,298.0,1809.0
2,B,92,149.95,5.0,34.0,76.319891,2.945652,17.076087,5.73,1.0,...,2,0.151934,-0.018901,0.042049,46.336308,1.424774,9.391642,7021.43,271.0,1571.0
3,B,96,148.31,5.0,35.0,73.00125,3.03125,19.083333,5.89,1.0,...,2,0.223938,-0.058031,-0.126098,38.871405,1.341273,10.703336,7008.12,291.0,1832.0
4,B,106,146.46,5.0,35.0,76.311038,2.575472,17.867925,5.81,1.0,...,2,-0.132077,0.372178,0.013343,42.492501,1.440556,9.942765,8088.97,273.0,1894.0
5,A,104,149.02,5.0,35.0,76.264904,2.778846,18.403846,5.91,1.0,...,2,0.098248,0.168882,-0.099025,42.131902,1.474421,11.129005,7931.55,289.0,1914.0


Se generaron 23 columnas diferentes entre agregaciones y transformaciones!!

## 4. Depth features

Featuretools ofrece una herramienta poderosa llamada Deep Feature Synthesis (DFS) para automatizar la creación de features a partir de datos relacionales. En lugar de generar features manualmente, DFS utiliza una búsqueda inteligente para descubrir y construir nuevas features combinando columnas de diferentes tablas relacionadas.

El parámetro **max_depth** en la función dfs de Featuretools controla la profundidad de la búsqueda de features realizada por el algoritmo Deep Feature Synthesis (DFS). Especifica el número máximo de relaciones que DFS puede seguir al crear nuevas features.

A medida que aumentamos el max_depth:
- Se crean más features.
- Se crean features que carecen de interpretabilidad.
- Aumentamos el costo computacional.

In [None]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="products",
    agg_primitives=["mean", "sum", "mode"],
    trans_primitives=["month", "hour"],
    max_depth=2,
)
feature_matrix

Unnamed: 0_level_0,brand,MEAN(transactions.amount),MEAN(transactions.customer_id),MEAN(transactions.session_id),MODE(transactions.device),MODE(transactions.zip_code),SUM(transactions.amount),SUM(transactions.customer_id),SUM(transactions.session_id),MODE(transactions.HOUR(birthday)),MODE(transactions.HOUR(join_date)),MODE(transactions.HOUR(session_start)),MODE(transactions.HOUR(transaction_time)),MODE(transactions.MONTH(birthday)),MODE(transactions.MONTH(join_date)),MODE(transactions.MONTH(session_start)),MODE(transactions.MONTH(transaction_time))
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,B,73.429314,2.921569,17.735294,desktop,60091,7489.79,298.0,1809.0,0,0,0,0,7,4,1,1
2,B,76.319891,2.945652,17.076087,desktop,60091,7021.43,271.0,1571.0,0,0,0,0,8,4,1,1
3,B,73.00125,3.03125,19.083333,desktop,60091,7008.12,291.0,1832.0,0,0,0,0,8,4,1,1
4,B,76.311038,2.575472,17.867925,desktop,60091,8088.97,273.0,1894.0,0,0,0,0,7,4,1,1
5,A,76.264904,2.778846,18.403846,mobile,60091,7931.55,289.0,1914.0,0,0,0,0,7,4,1,1


Por ejemplo: **MODE(transactions.MONTH(transaction_time))**
- Se compone de:
    - MODE -> Agg primitive.
    - MONTH -> Transform primitive

## 5. Handling time

Featuretools ofrece capacidades robustas para manejar datos con componentes temporales, permitiendo la creación de features que capturan información temporal relevante. Esto se logra principalmente a través de la especificación de columnas de tiempo en las entidades y el uso de funciones de agregación que consideran el tiempo.

Al definir tus entidades con EntitySet, debes especificar la columna que representa el tiempo para cada entidad. Esto se hace utilizando el argumento time_index en la función entity_from_dataframe. Esta columna debe ser de tipo fecha u hora. Featuretools utiliza esta información para generar features sensibles al tiempo.

### Cutoff time

El parámetro cutoff_time en Featuretools es crucial para el manejo de datos temporales. Determina hasta qué punto en el tiempo se deben considerar los datos al generar features con Deep Feature Synthesis (DFS). Es especialmente útil en escenarios predictivos, donde se quiere generar features que reflejen la información disponible hasta un momento específico (el momento de la predicción).

El valor del parametro cutooftime puede ser:

| Tipo de dato | Cutoff time diferente por id |
|---|:---:|
| **Timestamp** | No |
| **Lista timestamp** | No |
| **DataFrame** | Si |

Por ejemplo, aunque nuestros datos van desde 2014-01-02 podemos generar features historicas calculadas en periodos de tiempo diferentes al registro más actual.

![a](pics/cutoff_time.png)

In [None]:
es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)
es

Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

In [None]:
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=pd.Timestamp("2014-1-1 04:00"),
    instance_ids=[1, 2, 3],
    cutoff_time_in_index=True,
)
(
    fm
    .reset_index()
    .query('customer_id ==1')
)

Unnamed: 0,customer_id,time,zip_code,COUNT(sessions),MODE(sessions.device),NUM_UNIQUE(sessions.device),COUNT(transactions),MAX(transactions.amount),MEAN(transactions.amount),MIN(transactions.amount),...,STD(sessions.SKEW(transactions.amount)),STD(sessions.SUM(transactions.amount)),SUM(sessions.MAX(transactions.amount)),SUM(sessions.MEAN(transactions.amount)),SUM(sessions.MIN(transactions.amount)),SUM(sessions.NUM_UNIQUE(transactions.product_id)),SUM(sessions.SKEW(transactions.amount)),SUM(sessions.STD(transactions.amount)),MODE(transactions.sessions.device),NUM_UNIQUE(transactions.sessions.device)
0,1,2014-01-01 04:00:00,60091,4,tablet,3,67,139.23,74.002836,5.81,...,0.500353,271.917637,540.04,304.6017,27.62,20.0,-0.505043,169.572874,tablet,3


In [None]:
(
    es['sessions'].merge(
        es["transactions"],
        on = 'session_id',
        how='inner'
    )
    .query('customer_id ==1')
    .sort_values('transaction_time')
    # ['amount'].max()
)

Unnamed: 0,session_id,customer_id,device,session_start,_ft_last_time_x,transaction_id,transaction_time,product_id,amount,_ft_last_time_y
41,4,1,mobile,2014-01-01 00:44:25,2014-01-01 01:10:25,450,2014-01-01 00:44:25,4,21.35,2014-01-01 00:44:25
42,4,1,mobile,2014-01-01 00:44:25,2014-01-01 01:10:25,422,2014-01-01 00:45:30,5,108.11,2014-01-01 00:45:30
43,4,1,mobile,2014-01-01 00:44:25,2014-01-01 01:10:25,249,2014-01-01 00:46:35,5,112.53,2014-01-01 00:46:35
44,4,1,mobile,2014-01-01 00:44:25,2014-01-01 01:10:25,268,2014-01-01 00:47:40,5,6.29,2014-01-01 00:47:40
45,4,1,mobile,2014-01-01 00:44:25,2014-01-01 01:10:25,97,2014-01-01 00:48:45,3,47.95,2014-01-01 00:48:45
...,...,...,...,...,...,...,...,...,...,...
408,29,1,mobile,2014-01-01 07:10:05,2014-01-01 07:26:20,182,2014-01-01 07:22:00,2,94.89,2014-01-01 07:22:00
409,29,1,mobile,2014-01-01 07:10:05,2014-01-01 07:26:20,198,2014-01-01 07:23:05,4,104.93,2014-01-01 07:23:05
410,29,1,mobile,2014-01-01 07:10:05,2014-01-01 07:26:20,156,2014-01-01 07:24:10,4,121.59,2014-01-01 07:24:10
411,29,1,mobile,2014-01-01 07:10:05,2014-01-01 07:26:20,147,2014-01-01 07:25:15,4,116.33,2014-01-01 07:25:15


In [None]:
(
    es['sessions'].merge(
        es["transactions"],
        on = 'session_id',
        how='inner'
    )
    .query('customer_id ==1')
    .query('transaction_time <= "2014-01-01 04:00"')
    .sort_values('transaction_time')
    ['amount'].max()
)

np.float64(139.23)

### Cutoff time dataframe

En lugar de proporcionar un único valor de cutoff_time o una lista de valores, puedes usar un DataFrame para especificar un cutoff_time diferente para cada instancia (fila) de tu entidad objetivo. Esto es especialmente útil cuando tienes datos con diferentes puntos de corte temporales para cada registro. El DataFrame debe contener al menos dos columnas:

- Una columna que identifica la instancia: Esta columna debe tener el mismo nombre y tipo de dato que la columna índice de tu entidad objetivo.
- Una columna con los valores de cutoff_time: Esta columna debe contener los instantes de tiempo (datetime objects de pandas) correspondientes a cada instancia. El tipo de dato debe ser compatible con pandas.

In [None]:
cutoff_times = pd.DataFrame()
cutoff_times["customer_id"] = [1, 2, 3, 1]
cutoff_times["time"] = pd.to_datetime(
    ["2014-1-1 04:00", "2014-1-1 05:00", "2014-1-1 06:00", "2014-1-1 08:00"]
)
cutoff_times["label"] = [True, True, False, True]
cutoff_times

Unnamed: 0,customer_id,time,label
0,1,2014-01-01 04:00:00,True
1,2,2014-01-01 05:00:00,True
2,3,2014-01-01 06:00:00,False
3,1,2014-01-01 08:00:00,True


In [None]:
fm, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,
)
(
    fm
    .reset_index()
    .sort_values('customer_id')
)

Unnamed: 0,customer_id,time,zip_code,COUNT(sessions),MODE(sessions.device),NUM_UNIQUE(sessions.device),COUNT(transactions),MAX(transactions.amount),MEAN(transactions.amount),MIN(transactions.amount),...,STD(sessions.SUM(transactions.amount)),SUM(sessions.MAX(transactions.amount)),SUM(sessions.MEAN(transactions.amount)),SUM(sessions.MIN(transactions.amount)),SUM(sessions.NUM_UNIQUE(transactions.product_id)),SUM(sessions.SKEW(transactions.amount)),SUM(sessions.STD(transactions.amount)),MODE(transactions.sessions.device),NUM_UNIQUE(transactions.sessions.device),label
0,1,2014-01-01 04:00:00,60091,4,tablet,3,67,139.23,74.002836,5.81,...,271.917637,540.04,304.6017,27.62,20.0,-0.505043,169.572874,tablet,3,True
3,1,2014-01-01 08:00:00,60091,8,mobile,3,126,139.43,71.631905,5.81,...,279.510713,1057.97,582.193117,78.59,40.0,-0.476122,312.745952,mobile,3,True
1,2,2014-01-01 05:00:00,13244,5,desktop,2,62,146.81,83.149355,12.07,...,266.912832,688.14,418.096407,127.06,25.0,-0.269747,190.987775,desktop,2,True
2,3,2014-01-01 06:00:00,13244,4,desktop,2,44,146.31,65.174773,6.65,...,417.557763,493.07,290.968018,126.66,16.0,0.860577,119.136697,desktop,2,False


### Training window

En Featuretools, la training_window es un parámetro que, junto con cutoff_time, permite controlar con mayor precisión el período de tiempo utilizado para generar features en escenarios de series temporales. Mientras cutoff_time define el límite superior del período, training_window especifica el ancho de la ventana de tiempo antes de cutoff_time que se considera para generar cada feature.

Ejemplo:

Si cutoff_time es '2024-04-15' y training_window es pd.Timedelta(days=30), Featuretools utilizará los datos entre '2024-03-16' y '2024-04-15' para generar las features para esa instancia.

Strings temporales:

- 'N days'
- 'N weeks'
- 'N months'
- 'N years'
- 'N hours'
- 'N minutes'
- 'N seconds'
- 'N milliseconds'
- 'N microseconds'
- 'N nanoseconds'


![a](pics/cutoff_time_tw.png)

In [None]:
window_fm, window_features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,
    training_window="2 hour",
)

window_fm

Unnamed: 0_level_0,Unnamed: 1_level_0,zip_code,COUNT(sessions),MODE(sessions.device),NUM_UNIQUE(sessions.device),COUNT(transactions),MAX(transactions.amount),MEAN(transactions.amount),MIN(transactions.amount),MODE(transactions.product_id),NUM_UNIQUE(transactions.product_id),...,STD(sessions.SUM(transactions.amount)),SUM(sessions.MAX(transactions.amount)),SUM(sessions.MEAN(transactions.amount)),SUM(sessions.MIN(transactions.amount)),SUM(sessions.NUM_UNIQUE(transactions.product_id)),SUM(sessions.SKEW(transactions.amount)),SUM(sessions.STD(transactions.amount)),MODE(transactions.sessions.device),NUM_UNIQUE(transactions.sessions.device),label
customer_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,2014-01-01 04:00:00,60091,2,desktop,2,27,139.09,76.95037,5.81,4,5,...,18.667619,271.81,155.6045,12.59,10.0,-0.604638,86.730914,desktop,2,True
2,2014-01-01 05:00:00,13244,3,desktop,2,31,146.81,84.051935,12.07,4,5,...,203.331699,404.04,253.240615,90.35,15.0,-0.110009,109.500185,desktop,2,True
3,2014-01-01 06:00:00,13244,3,desktop,1,29,128.26,66.407586,6.65,1,5,...,477.281339,346.76,228.176684,118.47,11.0,0.242122,71.8719,desktop,1,False
1,2014-01-01 08:00:00,60091,3,mobile,2,47,139.43,66.471277,5.91,4,5,...,330.655558,384.44,198.98475,24.61,15.0,-0.003438,107.128899,mobile,2,True


## 6. Conclusiones

- **Pros**:
    - Automatización de la Ingeniería de Features.
    - Mitigación de errores de calculos.
    - Manejo de datos relacionales.
    - Generación de features basadas en tiempos.
    - Flexibilidad en creación de features personalizadas.
    
- **Cons**:
    - Eficiencia computacional ante grandes volumenes de datos.
    
Nota: Es posible paralelizar el proceso de creación de features usando la libreria DASK. La disminución en tiempos es significativa, aprox en 10x.