<a href="https://colab.research.google.com/github/virbickt/default-risk-prediction/blob/main/feature_engineering_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature engineering

Judging from the scores we got on the dataset without engineered features (a series of experiments which have been left out of the notebook for the purposes of presentation), feature engineering is decisive in this competition. However, what makes the feature engineering complicated is the fact that we possess almost no domain knowledge. As a result, in addition to the features that we have managed to come up ourselves, we will inevitably be drawing on the work of others. 

## First batch

First batch of engineered features consist of ratios between the features that have found to both make most sense domain-wise and rank among the most important features.
```python
for x in [WORKING_train, WORKING_test]:
  x['LOAN_RATE'] = x['AMT_ANNUITY'] / x['AMT_CREDIT']
  x['*goods_to_loan_rate'] = x['AMT_GOODS_PRICE'] / x['AMT_CREDIT']
  x['*APPS_CREDIT_GOODS_DIFF'] = x['AMT_CREDIT'] - x['AMT_GOODS_PRICE']

  x['*APPS_ANNUITY_INCOME_RATIO'] = x['AMT_ANNUITY']/x['AMT_INCOME_TOTAL']
  x['*APPS_CREDIT_INCOME_RATIO'] = x['AMT_CREDIT']/x['AMT_INCOME_TOTAL']
  x['*APPS_GOODS_INCOME_RATIO'] = x['AMT_GOODS_PRICE']/x['AMT_INCOME_TOTAL'] 

  x['*APPS_CNT_FAM_INCOME_RATIO'] = x['AMT_INCOME_TOTAL']/x['CNT_FAM_MEMBERS']

  x['*APPS_INCOME_EMPLOYED_RATIO'] = x['AMT_INCOME_TOTAL']/x['DAYS_EMPLOYED']
  x['*APPS_INCOME_BIRTH_RATIO'] = x['AMT_INCOME_TOTAL']/x['DAYS_BIRTH']
  x['*APPS_CAR_BIRTH_RATIO'] = x['OWN_CAR_AGE'] / x['DAYS_BIRTH']
  x['*APPS_CAR_EMPLOYED_RATIO'] = x['OWN_CAR_AGE'] / x['DAYS_EMPLOYED']

  x['*credit_income_ratio'] = x['AMT_CREDIT'] / x['AMT_INCOME_TOTAL']
  x['*employed_birth_ratio'] = x['DAYS_EMPLOYED'] / x['DAYS_BIRTH']
  x['*amt_req_sum'] = x[[x for x in x.columns if 'AMT_REQ_' in x]].sum(axis = 1)
```

## Second batch

Since `EXT_SOURCE_3`, `EXT_SOURCE_2` and `EXT_SOURCE_1` have found themselves among the most important features, we make the following aggregations:

- product
- weighted (although idiosyncratically) sum
- minimum value across the three features
- maximum value across the three features
- median value
- variance

```python
for df in [WORKING_train, WORKING_test]:
    df["EXT_SOURCES_PROD"] = (
        df["EXT_SOURCE_1"] * df["EXT_SOURCE_2"] * df["EXT_SOURCE_3"]
    )
    df["EXT_SOURCES_WEIGHTED"] = (
        df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3
    )

    for function_name in ["min", "max", "mean", "nanmedian", "var"]:
        feature_name = "EXT_SOURCES_{}".format(function_name.upper())
        df[feature_name] = eval("np.{}".format(function_name))(
            df[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]], axis=1
        )
```

## Third batch

[Kaggle grandmaster-inspired](https://github.com/rishabhrao1997/Home-Credit-Default-Risk/blob/main/Feature%20Engineering%20and%20Modelling.ipynb) mean of 500 neighbors with respect to the target found by KNeighborsClassifier (building on top of that we've added median, minimum, maximum, product and variance):

```python
knn = KNeighborsClassifier(500, n_jobs=-1)
train_data_for_neighbors = x_processing[
    ["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "CREDIT_ANNUITY_RATIO"]
]

test_data_for_neighbors = test_df_processing[
    ["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "CREDIT_ANNUITY_RATIO"]
]

knn.fit(train_data_for_neighbors, y)

train_500_neighbors = knn.kneighbors(train_data_for_neighbors)[1]
test_500_neighbors = knn.kneighbors(test_data_for_neighbors)[1]

x_processing["TARGET_NEIGHBORS_500_MEAN"] = [
    x_processing["TARGET"].iloc[ele].mean() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_MEAN"] = [
    x_processing["TARGET"].iloc[ele].mean() for ele in test_500_neighbors
]

x_processing["TARGET_NEIGHBORS_500_MEDIAN"] = [
    x_processing["TARGET"].iloc[ele].median() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_MEDIAN"] = [
    x_processing["TARGET"].iloc[ele].median() for ele in test_500_neighbors
]

x_processing["TARGET_NEIGHBORS_500_MIN"] = [
    x_processing["TARGET"].iloc[ele].min() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_MIN"] = [
    x_processing["TARGET"].iloc[ele].min() for ele in test_500_neighbors
]

x_processing["TARGET_NEIGHBORS_500_MAX"] = [
    x_processing["TARGET"].iloc[ele].max() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_MAX"] = [
    x_processing["TARGET"].iloc[ele].max() for ele in test_500_neighbors
]

x_processing["TARGET_NEIGHBORS_500_PROD"] = [
    x_processing["TARGET"].iloc[ele].prod() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_PROD"] = [
    x_processing["TARGET"].iloc[ele].prod() for ele in test_500_neighbors
]

x_processing["TARGET_NEIGHBORS_500_VAR"] = [
    x_processing["TARGET"].iloc[ele].var() for ele in train_500_neighbors
]
test_df_processing["TARGET_NEIGHBORS_500_VAR"] = [
    x_processing["TARGET"].iloc[ele].var() for ele in test_500_neighbors
]
```

## Correlations

Let's see how are the engineered features correlated with the target variable.

In [None]:
engineered = [
    "TARGET_NEIGHBORS_500_MEAN",
    "TARGET_NEIGHBORS_500_MEDIAN",
    "TARGET_NEIGHBORS_500_MIN",
    "TARGET_NEIGHBORS_500_MAX",
    "TARGET_NEIGHBORS_500_PROD",
    "TARGET_NEIGHBORS_500_VAR",
    "EXT_SOURCES_PROD",
    "EXT_SOURCES_WEIGHTED",
    "EXT_SOURCES_MIN",
    "EXT_SOURCES_MAX",
    "EXT_SOURCES_MEAN",
    "EXT_SOURCES_NANMEDIAN",
    "EXT_SOURCES_VAR",
    "LOAN_RATE",
    "*goods_to_loan_rate",
    "*APPS_CREDIT_GOODS_DIFF",
    "*APPS_ANNUITY_INCOME_RATIO",
    "*APPS_CREDIT_INCOME_RATIO",
    "*APPS_GOODS_INCOME_RATIO",
    "*APPS_CNT_FAM_INCOME_RATIO",
    "*APPS_INCOME_EMPLOYED_RATIO",
    "*APPS_INCOME_BIRTH_RATIO",
    "*APPS_CAR_BIRTH_RATIO",
    "*APPS_CAR_EMPLOYED_RATIO",
    "*credit_income_ratio",
    "*employed_birth_ratio",
    "*amt_req_sum",
]

# upload the imputed and encoded version of the data (extended with data from auxillary tables)
INCLUSIVE_train = pd.read_csv("/content/drive/MyDrive/341/2ALL_INCLUDED_train.csv")
INCLUSIVE_train['TARGET'] = app_train.TARGET

# make dataframe with the correlation coefficients
corrs = (
    INCLUSIVE_train
    .corr()["TARGET"]
    .to_frame()
    .reset_index()
    .iloc[1:, :]
).set_index('index')

fig = go.Figure(go.Bar(
            x=corrs.drop(["Unnamed: 0.1", 'TARGET'], axis=0).loc[engineered]['TARGET'].values.tolist(),
            y=corrs.drop(["Unnamed: 0.1", 'TARGET'], axis=0).loc[engineered]['TARGET'].index.tolist(),
            orientation='h'))

fig.update_layout(
    autosize=False,
    width=1000,
    height=700,
    xaxis_title="Pearson correlation",
    yaxis_title="Engineered features",
)

fig.show()