# Simple FeatureTools demo

**Copyright:**
```
© 2019, Jan Hynek, Vaclav Svoboda, Martin Kotek and HomeCredit International a.s.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License 
- http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

**Disclaimer #1:**

**This notebook serves more as the simplest demonstration of FeatureTools use for HomeCredit scoring purposes.**

For advanced feature tools use, I strongly recommend official [FeatureTools notebook done on HomeCredit Kaggle competition data](https://github.com/Featuretools/predict-loan-repayment/blob/master/Automated%20Loan%20Repayment.ipynb)

This notebook is not intended to be run once, from beginning to end, but rather à la carte - when you need something, just pick that up (i.e. creation of custom primitives).

Other than that, there are also links to other interesting resources in the docs, etc.

**Disclaimer #2**:

This is jupyter notebook, and markdown (and especially links:/) does not work correctly on gitlab.
You are advised to download this notebook locally.

**Disclaimer #3**:

Do not hesitate to contact me at jan.hynek@homecredit.eu in case you would have any questions.


# Setup
In this part, we create basic setup.<br>

In [None]:
from pprint import pprint

import featuretools as ft
import numpy as np
import pandas as pd
import shap

import xgboost

shap.initjs()

Versions used:

In [None]:
print(
f"""
featuretools: {ft.__version__}
numpy: {np.__version__}
pandas: {pd.__version__}
shap: {shap.__version__}
xgboost: {xgboost.__version__}
"""
)


```
featuretools: 0.8.0
numpy: 1.16.3
pandas: 0.24.2
shap: 0.29.1
xgboost: 0.82

```

**We create following variables for several reasons:**
- if we would like to rename the columns afterwards, it will be very easy - we can change that on one place only (good coding practice)
- Jupyter Notebook now gives us variable suggestions! We do not have to write the name over and over and over

In [None]:
ID_APPLICATION_COLUMN = "ID_APPLICATION"
ID_TRANSACTIONS_COLUMN = "ID_TRANSACTION"
TIME_TRANSACTION_COLUMN = "TIME"
TIME_APPLICATION_COLUMN = "TIME_APPLICATION"
ENTITY_SET_NAME = "clients"
TRANSACTIONS_ENTITY_NAME = "transactions"
APPLICATIONS_ENTITY_NAME = "applications"

**Select parts, we want to execute:**

In [None]:
# Choose whether you would like to calculate monthly aggregations - see part 3
CALCULATE_MONTHLY_AGGREGATIONS = False

# Choose whether to do automatic segmentation - see part 4
AUTOMATIC_INTERESTING_VALUES = False

# maximal DFS depth - see part 6
MAX_DFS_DEPTH = 1
if CALCULATE_MONTHLY_AGGREGATIONS:
    MAX_DFS_DEPTH = 3

**Data loading:**

In [None]:
# Data loading
RAWDATA_PATH = "demo_data/rawsim2.csv"
transactions = pd.read_csv(RAWDATA_PATH, sep=",", decimal=".", encoding="ANSI")
transactions.head()

## _FeatureTools_ internal structure

_FT_ work with _entities_. We can imagine the entities as an internal database structure, <br>
which just captures the relationships between individual datasets.

__Glossary__
- One dataset = one entity
- One resulting dataset = one entity - set
- One entity set = multiple entities
- Entity ID = dataset identifier in the entity
- Relationship - mapping between entities


## Single dataset

When we have a single dataset, such dataset can already have multiple joined data.
Feature tools allow to easily capture the relationships between such data.

To capture that, we just need to tell the feature tools names of the individual indices, in this case `TRANSACTION_ID` and `APPLICATION_ID`.


For example of good use of `normalize_entity` function [see this notebook](https://github.com/Featuretools/predict-appointment-noshow/blob/master/Tutorial.ipynb).

In [None]:
# empty entity set
entity_set = ft.EntitySet(id=ENTITY_SET_NAME)


# we fill the entity_set with the dataframes, and say, which IDs are relevant for given DF
entity_set.entity_from_dataframe(
    entity_id=TRANSACTIONS_ENTITY_NAME,
    dataframe=transactions,
    index=ID_TRANSACTIONS_COLUMN,
    time_index=TIME_TRANSACTION_COLUMN,
)

entity_set.normalize_entity(
    base_entity_id=TRANSACTIONS_ENTITY_NAME,
    new_entity_id=APPLICATIONS_ENTITY_NAME,
    index=ID_APPLICATION_COLUMN,
    make_time_index=False,
)

# entity_set.plot()


## Multiple datasets

However, feature tools are stronger when we have multiple datasets.

If we create one huge dataset, this takes a lot of memory - same value (i.e. application times) is multiplied several times.

We can create the same  final dataset with relationship definition only.

To see how complicated these relationships can be, I recommend [this notebook](https://github.com/Featuretools/predict-olympic-medals/blob/master/PredictOlympicMedals.ipynb) where this [entity set loading function is utilized](https://github.com/Featuretools/predict-olympic-medals/blob/901c4dcbd0ae0aa82994dc4ab80dba44c9e2a35b/utils.py#L115).

In [None]:
# for demonstration purposes only - we divide the dataset in application ids and transactions
# aggregated on the application level
applications = transactions[
    [ID_APPLICATION_COLUMN]
].drop_duplicates()


Let's assume now, that we have two independent datasets. 

With feature tools, we can easly capture the relationships in the data, and process the data accordingly.

In [None]:
# empty entity set
entity_set = ft.EntitySet(id=ENTITY_SET_NAME)


# we fill the entity_set with the dataframes, and say, which IDs are relevant for given DF
entity_set.entity_from_dataframe(
    entity_id=TRANSACTIONS_ENTITY_NAME,
    dataframe=transactions,
    index=ID_TRANSACTIONS_COLUMN,
    time_index=TIME_TRANSACTION_COLUMN,
)


#
entity_set.entity_from_dataframe(
    entity_id=APPLICATIONS_ENTITY_NAME,
    dataframe=applications,
    index=ID_APPLICATION_COLUMN,
)

# Specification of the relationship between entities
r_transactions_applications = ft.Relationship(
    child_variable=entity_set[TRANSACTIONS_ENTITY_NAME][ID_APPLICATION_COLUMN],
    parent_variable=entity_set[APPLICATIONS_ENTITY_NAME][ID_APPLICATION_COLUMN],
)
entity_set.add_relationship(r_transactions_applications)

Let's check whether FeatureTools identified all columns correctly.

In [None]:
entity_set['transactions']

# Handling time dimension

Feature tools have several ways, how to handle time. The most important terms are _time indices_ and _cutoff times_.

To explain:

- time indices - this is the time point, when our data became known. 
- cutoff times - when we should cut the times, as everything else is future information

In this case, we set time indices as the `TIME_TRANSACTION_COLUMN` and we will create cutoff times from `TIME_APPLICATION_COLUMN`.

This will ensure that:

- when creating features with time dimension - these will take into account `TIME_TRANSACTION_COLUMN` only
- if correctly paired, all features with time dimension will take into account only variables before observations in `TIME_APPLICATION_COLUMN`

More about this (strongly recommended!) is in the [feature tools documentation.](https://docs.featuretools.com/automated_feature_engineering/handling_time.html)

## Cutoff times

Cutoff times need only two pieces of information 

- `ID_APPLICATION` - so they can be correctly paired
- `TIME_APPLICATION` - so they can cut the future information off correctly, for given application id.

_NOTE: Interesting trick with cutoff times is that any column added to column times is just pinned to the final dataset.
If we have some variable, which is already correctly calculated for the final dataset - we can add them here._

In [None]:
cutoff_times = transactions[
    [
        TIME_APPLICATION_COLUMN,
        ID_APPLICATION_COLUMN,
        # variable_1_which_i_would_like_to_add_to_final_data,
        # variable_2_which_i_would_like_to_add_to_final_data, ...
    ]
].drop_duplicates()

# transactions = transactions.drop([
# variable_1_which_i_would_like_to_add_to_final_data,
# variable_2_which_i_would_like_to_add_to_final_data,
# ])


## Intermezzo: Monthly aggregations

What if we would like to **aggregate data on monthly level first**, and then **afterwards aggregate these?** (i.e. Slicer functionality).

This part also shows how to **apply cutoff times manually** - we just omit all observations, where `MONTH_DIFFERENCE` will be negative.

Proposed solution:

- create column, which indicates the number of months between `TIME_TRANSACTION_COLUMN` and `TIME_APPLICATION_COLUMN`. Call this column `MONTH_DIFFERENCE`.
- if you want to remove future - remove negative months from `MONTH_DIFFERENCE`.
- create combined `ID_APPLICATION_MONTH_DIFFERENCE` column by string concatenation.
- normalize entity on `ID_APPLICATION_MONTH_DIFFERENCE`.
- normalize entity on `ID_APPLICATION` as well.

__NOTE:__ Calculation of monthly aggregations need DFS depth of 3

In [None]:
if CALCULATE_MONTHLY_AGGREGATIONS:

    # EXAMPLE CODE: NOT TESTED IN THIS WORKFLOW
    MONTH_DIFFERENCE_COLUMN = "MONTH_DIFFERENCE"
    ID_APPLICATION_MONTH_COLUMN = "ID_APPLICATION_MONTH"
    MONTH_ENTITY_NAME = 'monthly'


    def months_between(d1, d2):
        # same day in month = months_between = 0 (d1.dt.day<=d2.dt.day); count from 1 (the +1)
        return (
            (d1.dt.year - d2.dt.year) * 12
            + d1.dt.month
            - d2.dt.month
            - (d1.dt.day <= d2.dt.day) * 1
            + 1
        )


In this part we **show how to get rid of the future**.
We have calculated number of months between, and now we omit the rows, where this difference is negative - transactions are in the future.

**IMPORTANT NOTE:** We still have to use cutoff times, even if we get rid of the future. This is because cutoff time is fed into our custom primitives so we know, how far away back we should look in the time.

If we want to calculate all custom primitives from single timepoint, we can do it with
```
...
    cutoff_time=pd.Timestamp("2014-1-1 04:00")
...
```
when calculating dfs.

In [None]:
if CALCULATE_MONTHLY_AGGREGATIONS:


    transactions[MONTH_DIFFERENCE_COLUMN] = np.ceil(months_between(
        transactions[TIME_APPLICATION_COLUMN], transactions[TIME_TRANSACTION_COLUMN]
    ))

    transactions = transactions.query(f"{MONTH_DIFFERENCE_COLUMN} >= 0")


**Now, we create new entity set**, with two normalised subentities. One for applications, and one for months+application ids.

In [None]:
if CALCULATE_MONTHLY_AGGREGATIONS:
    transactions[ID_APPLICATION_MONTH_COLUMN] = (
        transactions[ID_APPLICATION_COLUMN].astype(str)
        + "_"
        + transactions[MONTH_DIFFERENCE_COLUMN].astype(str)
    )
    
    # empty entity set
    entity_set = ft.EntitySet(id=ENTITY_SET_NAME)


    # we fill the entity_set with the dataframes, and say, which IDs are relevant for given DF
    entity_set.entity_from_dataframe(
        entity_id=TRANSACTIONS_ENTITY_NAME,
        dataframe=transactions,
        index=ID_TRANSACTIONS_COLUMN,
        time_index=TIME_TRANSACTION_COLUMN,
    )

    entity_set.normalize_entity(
        base_entity_id=TRANSACTIONS_ENTITY_NAME,
        new_entity_id=APPLICATIONS_ENTITY_NAME,
        index=ID_APPLICATION_COLUMN,
        make_time_index=False,
    )

    entity_set.normalize_entity(
        base_entity_id=TRANSACTIONS_ENTITY_NAME,
        new_entity_id=MONTH_ENTITY_NAME,
        index=ID_APPLICATION_MONTH_COLUMN,
        make_time_index=False,
    )

entity_set

# Important Values - segmentation

This line is one of the most interesting ones - we show featuretools, __by which variable we would like to segment the features.__ <br>

Let's say that we are interested in segmentation by POS/ATM. Apart from that, we would like to know whether there are some differences in Hradec, Praha and Ostrava.

In [None]:
print(f' CITY:        {transactions["CITY"].unique().tolist()}')
print(f' TRANS_PLACE: {transactions["TRANS_PLACE"].unique().tolist()}')

To perform the segmenting, we need to set the `interesting_values` attribute.

In [None]:
entity_set[TRANSACTIONS_ENTITY_NAME]["TRANS_PLACE"].interesting_values = (
    transactions["TRANS_PLACE"].unique().tolist()
)
entity_set[TRANSACTIONS_ENTITY_NAME]["CITY"].interesting_values = ["Hradec", "Praha", "Ostrava"]

Or, if we are lazy to define only the values which interest us, we can leave everything for the feature tools. Following command will create segmentations for every variable. We can set `max_values` argument, to limit the number of segments

In [None]:
if AUTOMATIC_INTERESTING_VALUES:
    entity_set.add_interesting_values(max_values=4)

# Primitives

The basic FT building blocks are primitives. These are the lego pieces, using which the individual features are being built.

In the basic tools, we are provided with more than 70 different basic feature primitives.


In [None]:
with pd.option_context("display.max_rows", 100):
    display(ft.list_primitives())

## Custom primitives

Feature Tools allow for easy creation of new primitives.

Let's see how some simple primitive is done.

Common use case in HCI is to create aggregation over some time window.
We will show another way how to do these time windows, but one of the solutions is creation of custom primitives.

__First, we need to define the function itself.__ This can be typical pandas aggregation function. <br>

__But we decided to use numpy structures and functions only__ - to get performance gain, as numpy functions are __often 400x__ faster than adequate pandas equivalents. <br>
These functions are applied very often, and every optimalisation counts in the final performance.

Interesting is the reserved argument `time=None`. This is reserved by feature tools itself and is applied when `cutoff_time` is specified. 

In [None]:
CURRENT_FUNCTIONS = dict(
    max=np.nanmax,
    min=np.nanmin,
    mean=np.nanmean,
    sum=np.nansum,
    count=len,
    mean_monthly=lambda x: np.nanmean(x) * 30,
)


def agg_last_x_days(values, time_col, days, func, time=None):
    """
    Aggregate given data using prespecified aggregation functions.
    Possibilities are set in CURRENT_FUNCTIONS
    
    """
    data = values[
        time_col.values >= np.datetime64(time - np.timedelta64(days, "D"))
    ].values
    try:
        result = CURRENT_FUNCTIONS[func](data)
    except ValueError:
        result = None
    except KeyError:
        print("Unidentified aggregation function")
        raise
    return result


Next, it is often helpful to provide custom name generating function.

If the functions is intended to be used in interesting values, then it needs to have pre-specified where_stirng inside its name. Otherwise it is ignored (or to be more precise - it is rewritten).

In [None]:
from featuretools.version import __version__
if __version__ == '0.9.0':

    def agg_last_x_days_generate_name(self, base_feature_names, **kwargs):
    #     print(self.kwargs)
    #     breakpoint()
    #     print(kwargs)
        name = "{func}_{days}D({child_entity_id}.{feature}{where_string})".format(
            func=self.kwargs["func"].upper(),
            days=str(self.kwargs["days"]),
            child_entity_id=kwargs['relationship_path_name'],
            feature=base_feature_names[0],
            where_string=kwargs["where_str"],
        )
        return name
else:
    def agg_last_x_days_generate_name(self, child_entity_id, base_feature_names, **kwargs):
#         print(kwargs)
        name = "{func}_{days}D({child_entity_id}.{feature}{where_string})".format(
            func=self.kwargs["func"].upper(),
            days=str(self.kwargs["days"]),
            child_entity_id=child_entity_id,
            feature=base_feature_names[0],
            where_string=kwargs["where_str"],
        )
        return name


In the end, feature tools provide us with function to create custom primitives. It is needed to specify input and output types for the individual function, so the deep feature synthesis can stack the features together.

In [None]:
AGG_LAST_X_DAYS = ft.primitives.make_agg_primitive(
    function=agg_last_x_days,  # function to be used
    input_types=[
        ft.variable_types.Numeric,
        ft.variable_types.DatetimeTimeIndex,
    ],  # input data types
    return_type=ft.variable_types.Numeric,  # data types to be returned
    uses_calc_time=True,  # whether function can utilize cutoff time information
    stack_on_self=False,  # whether the primitive could be stacked on itself
    cls_attributes={
        "generate_name": agg_last_x_days_generate_name
    },  # passing of name generating function
)


We have already some custom primitives prepared:

In [None]:
import sys
sys.path.insert(0, '../')
from scoring.feature_tools.custom_primitives import (
    TIME_SINCE_LAST,
    TIME_SINCE_FIRST,
    AVG_TIME_BETWEEN,
    COUNT_X_DAYS,
)

# Deep Feature Synthesis

The most important cell - here we create the features itself. __We use the building blocks - aggregations and transformations.__ Just like lego. And from these building blocks, we create the final features.

We can also control the complexity of the features using `max_depth`. 



## DFS Definition

Now we are defining the lego blocks with which the DFS is going to play with.


- **target entity**, on which level the data should be specified
- **maximum depth of the features** - how many primitives should the features use, at most.  Rule of thumb is to use 1 for easily explained features, 2 in case you want to get ratios.

**See Section 8 - appendix for definition of more advanced config**

In [None]:
primitives_dfs_definition = dict(
    # entity, for which we would like to calculate individual features
    target_entity=APPLICATIONS_ENTITY_NAME,
    # depth definition
    max_depth=MAX_DFS_DEPTH,

    agg_primitives=[
        "max",
        "mean",
    ],
    # primitives to be used for transformations (1:1 mapping)
    trans_primitives=[
        "days_since",
    ],
    # primitives to be used to aggregate important values (n:1 mapping - same as aggregations)
    where_primitives=[
        "max",
        "median",
    ],
)


## Computation definition


Regarding the computation, we should also specify several technical aspects of the `dfs`.
We should specify:
- **entity set**, which has links to the dataframes
- **cutoff_time** - so our dataset is not spoiled with future times
- whether there are **some variables, which we would like to ignore**
- **number of jobs**, for parallelisation. Good amount of jobs is even number, preferably multiple of available cores.

In [None]:
technical_dfs_definition = dict(
    # relationship specification
    entityset=entity_set,
    
    # parallelisation
    n_jobs=4,
    
    # cutoff times for future omission
    cutoff_time=cutoff_times if not CALCULATE_MONTHLY_AGGREGATIONS else None,
)

## Running the script itself

**And now, this is where the magic happens.**

We perform the deep feature synthesis. For detailed view, what is happening behind you can observe [the original paper](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf) or [the summary of the key points](https://blog.featurelabs.com/deep-feature-synthesis/).

In [None]:
variables_run1 = ft.dfs(
    # whether feature synthesis should be run or not - the next step
    features_only=True,
    **technical_dfs_definition,
    **primitives_dfs_definition
)

print(f"Number of created variables: {len(variables_run1)}")
pprint(variables_run1)

And we can do **multiple runs** of `dfs`. Notice that in subsequent runs, we can **manipulate with the feature depth.**

In [None]:
primitives_dfs_definition_depth3 = dict(
    target_entity=APPLICATIONS_ENTITY_NAME,
    max_depth=3,

    agg_primitives=[
        'min',
        'trend'
    ],
    trans_primitives=[
        "diff"
    ],
    where_primitives=[
        "sum",
        AGG_LAST_X_DAYS(days=30, func="max"),
    ],
)
variables_to_create_depth3 = ft.dfs(
    features_only=True,
    **technical_dfs_definition,
    **primitives_dfs_definition_depth3
)
pprint(variables_to_create_depth3)


Let's choose only some of the variables

In [None]:
variables_run2 = [var for var in variables_to_create_depth3 if 'DIFF' in var.generate_name()]
pprint(variables_run2)

Now, we can either select meaningful variables beforehand, or calculate the features directly.

In [None]:
variables_to_create = variables_run1 + variables_run2

final_dataset = ft.calculate_feature_matrix(
        features=variables_to_create,
        **technical_dfs_definition,
)
    
display(final_dataset.head())

# Feature selection

In the next part, we can observe whether we found the correct features. So, we create arbitrary target.

## Creation of arbitrary target
In the following part I create arbitrary target. We create several arbitrary relevant features. Maximal fee plays a role, and so does the average spending. It also plays a role where the person comes from.

In [None]:
# target definition
target_base_1 = transactions.groupby([ID_APPLICATION_COLUMN]).mean()
target_base_2 = transactions.groupby([ID_APPLICATION_COLUMN]).max()
target_base_3 = (
    transactions.groupby([ID_APPLICATION_COLUMN])
    .CITY
    .apply(lambda x: x.mode())
    .reset_index()
    .query("level_1 == 0")
    .set_index(ID_APPLICATION_COLUMN)
)


# defining target relationship with the data. To have absolute control over the feature engineering
def standardize(column):
    return (column - column.mean()) / column.std()

# True target = default
target = (
    # Amount is very important
    (standardize(target_base_1["AMOUNT"]) * -2)
    # Opposite effect have maximum fee
    + (standardize(target_base_2["FEE"]) * 2)
    # Living in the city has a negative effect on default
    + (standardize(target_base_3["CITY"] == "Praha") * -2)
    # Living in Ostrava has positive effect.
    + (standardize(target_base_3["CITY"] == "Ostrava") * 2)
)

target = standardize(target) > 0.5  # Approx 30% will be True, otherwise False

In [None]:
sum(target) / len(target)

## XGBoost feature importance (using SHAP values)

In this part we evaluate the feature importance of individual features.

In the next part, we do basic preprocessing. We identify categorical features (automatically), and recode them using one hot encoding.

In [None]:
import shap
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

categorical_features = [
    feat for feat in final_dataset.columns if final_dataset[feat].dtype.kind == "O"
]

for feat in categorical_features:
    final_dataset = pd.concat(
        [final_dataset, pd.get_dummies(final_dataset[feat], prefix=feat)], axis=1
    ).drop(columns=[feat])

For model evaluation, we split the data in 3 parts.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    final_dataset, target, test_size=0.2, random_state=43
)
X_train, X_valid, y_train, y_valid = train_test_split(
    final_dataset, target, test_size=0.3, random_state=43
)

### Training

Now, we train basic XGBoost classifier. This is very fast, in contrast with feature tools.

In [None]:
params = {
    'learning_rate': 0.1,
    'max_depth': 5,

    'n_estimators': 1000,
    'early_stopping_rounds': 20,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'n_jobs': 1,
    'random_state': 12345
}

booster_xgb_sklearn = XGBClassifier(**params)

booster_xgb_sklearn.fit(X_train, y_train,
                        eval_set=[(X_train, y_train),
                                  (X_valid, y_valid)],
                        early_stopping_rounds = params['early_stopping_rounds']
                       )

### Feature evaluation

When the model is trained, we can now find the best features. 

As we already know the underlying features, we can observe that even though we have not captured the features exactly (e.g. max(transactions.FEE)), a lot of found features are similar, but segmented. 

To evaluate individual features, SHAP values are used. The main idea comes from Shapley values - term coined in  game theory;  **It calculates how each feature, all other things given, would change the output of the model.**

More about SHAP values can be [found in the original paper](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf), in this [descriptive article](https://towardsdatascience.com/how-to-avoid-the-machine-learning-blackbox-with-shap-da567fc64a8b), or here [on kaggle](https://www.kaggle.com/dansbecker/shap-values).

In [None]:
feature_names_xgb = booster_xgb_sklearn.get_booster().feature_names
explainer_xgb = shap.TreeExplainer(booster_xgb_sklearn)
shap_values_xgb = explainer_xgb.shap_values(X_train[feature_names_xgb])
shap.summary_plot(shap_values_xgb, X_train[feature_names_xgb], max_display=30, plot_type='bar')

Now we can select a cutoff and choose, which variables are the most important for the individual model.
Now, using the cutoff we can select the most important features.

In [None]:
# CUTOFF = 0.05

shap_importance_xgb = (
    pd.DataFrame(shap_values_xgb, columns=feature_names_xgb)
    .abs()
    .mean()
    .sort_values(ascending=False)
)
# chosen_features = shap_importance_xgb.index[shap_importance_xgb > CUTOFF].tolist()

display(shap_importance_xgb)

We can also try to find some itneractions and visualize the dependence plots as well. See https://slundberg.github.io/shap/notebooks/NHANES%20I%20Survival%20Model.html

# Appendix


One of the possible primitive variable definitions, w/ commented functions

In [None]:

primitives_dfs_definition = dict(
    # entity, for which we would like to calculate individual features
    target_entity=APPLICATIONS_ENTITY_NAME,
    # depth definition
    max_depth=MAX_DFS_DEPTH,
    # output cleaning part
    # variables which will be dropped afterwards
    # drop_contains = [f'({ID_APPLICATION_COLUMN})', f'1 / {ID_APPLICATION_COLUMN}'],
    # variables not to be calculated - ignored in dfs
    # ignore_variables={TRANSACTIONS_ENTITY_NAME: [ID_APPLICATION_COLUMN]},
    # features to be used for aggregations (n:1 mapping)
    agg_primitives=[
#         AGG_LAST_X_DAYS(days=30, func="max"),
        # AGG_LAST_X_DAYS(days=10, func="min"),
#         AGG_LAST_X_DAYS(days=180, func="mean_monthly"),
        "max",
#         "mode",
#         "min",
        "mean",
#         "trend",
    ],
    # primitives to be used for transformations (1:1 mapping)
    trans_primitives=[
#         "week",
#         "month",
#         "year",
        "days_since",
#         "diff",
#         "absolute",
#         "divide_by_feature",
#         "cum_min",
#         "cum_max",
    ],
    # primitives to be used to aggregate important values (n:1 mapping - same as aggregations)
    where_primitives=[
#         "min",
        "max",
        "median",
#         "trend",
#         "sum",
#         AGG_LAST_X_DAYS(days=30, func="max"),
    ],
)