## Feature Engineering

&emsp;&emsp; Create general combination features and business statistical features; and after the features are created, filter the features based on the Filter method of the Pearson correlation coefficient.

### 1. Demo for explanation : Generic composite feature creation

&emsp;&emsp;Features created by counting the sum of different discrete features at different value levels and different continuous feature values, and grouping and summing according to card_id

<center><img src="https://i.loli.net/2021/10/23/KUTtb32o9eS1MJW.png" alt="image-20211023153800138" style="zoom:67%;" />

The data set created by this method can not only represent the consumption of each card_id from as many dimensions as possible, but also can be successfully spliced with the training set/test set, so as to be brought into the model for modeling.

In [1]:
import gc
import time
import numpy as np
import pandas as pd
from datetime import datetime

In [13]:
# Create a DataFrame using a dictionary
d1 = {'card_id':[1, 2, 1, 3], 
      'A':[1, 2, 1, 2], 
      'B':[2, 1, 2, 2], 
      'C':[4, 5, 1, 5], 
      'D':[7, 5, 4, 8],}

t1 = pd.DataFrame(t1)
t1

Unnamed: 0,card_id,A,B,C,D
0,1,1,2,4,7
1,2,2,1,5,5
2,1,1,2,1,4
3,3,2,2,5,8


In [10]:
# Label Feature Class
numeric_cols = ['C', 'D']
category_cols = ['A', 'B']

In [11]:
# Create a dictionary with id as key and empty dictionary as value
features = {}
card_all = t1['card_id'].values.tolist()
for card in card_all:
    features[card] = {}

In [12]:
features

{1: {}, 2: {}, 3: {}}

In [18]:
# list of all field names
columns = t1.columns.tolist()
columns

['card_id', 'A', 'B', 'C', 'D']

In [19]:
# the index value of card_id in the list
idx = columns.index('card_id')
idx

0

In [31]:
# Index value of discrete field
category_cols_index = [columns.index(col) for col in category_cols]
category_cols_index

[1, 2]

In [32]:
# Index value of continuous field
numeric_cols_index = [columns.index(col) for col in numeric_cols]
numeric_cols_index

[3, 4]

In [25]:
# Combining different values of discrete fields and continuous fields in pairs
# Simultaneously complete group summation
for i in range(t1.shape[0]):
    va = t1.loc[i].values
    card = va[idx]
    for cate_ind in category_cols_index:
        for num_ind in numeric_cols_index:
            col_name = '&'.join([columns[cate_ind], str(va[cate_ind]), columns[num_ind]])
            features[card][col_name] = features[card].get(col_name, 0) + va[num_ind]

&emsp;&emsp;Then view the final result of features

In [26]:
features

{1: {'A&1&C': 5, 'A&1&D': 11, 'B&2&C': 5, 'B&2&D': 11},
 2: {'A&2&C': 5, 'A&2&D': 5, 'B&1&C': 5, 'B&1&D': 5},
 3: {'A&2&C': 5, 'A&2&D': 8, 'B&2&C': 5, 'B&2&D': 8}}

At this time, features is a group summation result under different card_ids after combining different values of discrete variables and continuous variables into new features. Next we convert it to a DataFrame：

In [33]:
# convert to df
df = pd.DataFrame(features).T.reset_index()

# Label all columns
cols = df.columns.tolist()

# Modify the feature name of df
df.columns = ['card_id'] + cols[1:]
df

Unnamed: 0,card_id,A&1&C,A&1&D,B&2&C,B&2&D,A&2&C,A&2&D,B&1&C,B&1&D
0,1,5.0,11.0,5.0,11.0,,,,
1,2,,,,,5.0,5.0,5.0,5.0
2,3,,,5.0,8.0,5.0,8.0,,


&emsp;&emsp;This method of feature creation can very efficiently represent the hidden information in more data sets, but this method is prone to produce more null values, and the problem caused by the sparseness of the feature matrix needs to be considered in the subsequent modeling process.

#### 1.2 Create general composite features based on transaction datasets

&emsp;&emsp;The transaction read here is the previously created transaction_d_pre.csv dataset.

In [65]:
train = pd.read_csv('preprocess/train_pre.csv')
test =  pd.read_csv('preprocess/test_pre.csv')
transaction = pd.read_csv('preprocess/transaction_d_pre.csv')

- Field type annotation

In [34]:
# Label discrete or continuous fields
numeric_cols = ['purchase_amount', 'installments']

category_cols = ['authorized_flag', 'city_id', 'category_1',
       'category_3', 'merchant_category_id','month_lag','most_recent_sales_range',
                 'most_recent_purchases_range', 'category_4',
                 'purchase_month', 'purchase_hour_section', 'purchase_day']

id_cols = ['card_id', 'merchant_id']

- feature creation

In [7]:
# Create a dictionary for storing data
features = {}
card_all = train['card_id'].append(test['card_id']).values.tolist()
for card in card_all:
    features[card] = {}
     
# Indexes that mark fields of different types
columns = transaction.columns.tolist()
idx = columns.index('card_id')
category_cols_index = [columns.index(col) for col in category_cols]
numeric_cols_index = [columns.index(col) for col in numeric_cols]

# record running time
s = time.time()
num = 0

# Execute the loop, and record the time during the process
for i in range(transaction.shape[0]):
    va = transaction.loc[i].values
    card = va[idx]
    for cate_ind in category_cols_index:
        for num_ind in numeric_cols_index:
            col_name = '&'.join([columns[cate_ind], va[cate_ind], columns[num_ind]])
            features[card][col_name] = features[card].get(col_name, 0) + va[num_ind]
    num += 1
    if num%1000000==0:
        print(time.time()-s, "s")
del transaction
gc.collect()

142.746732711792 s
241.50783610343933 s
338.9149408340454 s
436.4667372703552 s
533.113107919693 s
629.66761469841 s
727.1969571113586 s
824.3946213722229 s
921.0717754364014 s
1017.7034878730774 s
1114.4361855983734 s
1211.2046930789948 s
1308.0264575481415 s
1404.8067374229431 s
1501.533932209015 s
1598.396145105362 s
1695.2529389858246 s
1792.5994687080383 s
1889.7299542427063 s
1987.0093190670013 s
2084.849946975708 s
2183.5546836853027 s
2281.704159259796 s
2379.819750070572 s
2478.2387039661407 s
2576.8626248836517 s
2676.383053302765 s
2777.3995122909546 s
2879.5466351509094 s
2981.8099772930145 s
3085.8226635456085 s


0

&emsp;&emsp;After extracting the features, the next step is to merge the transaction data features into the training set and test set:

In [8]:
# Dictionary to dataframe
df = pd.DataFrame(features).T.reset_index()
del features
cols = df.columns.tolist()
df.columns = ['card_id'] + cols[1:]

# Generate training and test sets
train = pd.merge(train, df, how='left', on='card_id')
test =  pd.merge(test, df, how='left', on='card_id')
del df
train.to_csv("preprocess/train_dict.csv", index=False)
test.to_csv("preprocess/test_dict.csv", index=False)

gc.collect()

&emsp;&emsp;Simply view the basic situation of the data set:

<center><img src="https://i.loli.net/2021/10/23/ZY75eSk3pAayoJn.png" alt="image-20211023161451438" style="zoom:67%;" />

### 2.Create business statistics

&emsp;&emsp;Another method, First group according to card_id, then count different fields and related statistics in each group, and then use them as features and bring them into modeling. Its basic structural features are as follows:

<center><img src="https://i.loli.net/2021/10/23/NupDc9JnBbHRPgU.png" alt="image-20211023162730619" style="zoom:80%;" />

The features constructed by this method will not have a large number of missing values, and there will be relatively few new columns.

In [36]:
transaction = pd.read_csv('preprocess/transaction_g_pre.csv')

- Field type annotation

In [34]:
# Label discrete or continuous fields
numeric_cols = ['authorized_flag',  'category_1', 'installments',
       'category_3',  'month_lag','purchase_month','purchase_day','purchase_day_diff', 'purchase_month_diff',
       'purchase_amount', 'category_2', 
       'purchase_month', 'purchase_hour_section', 'purchase_day',
       'most_recent_sales_range', 'most_recent_purchases_range', 'category_4']
categorical_cols = ['city_id', 'merchant_category_id', 'merchant_id', 'state_id', 'subsector_id']

- feature extraction process

In [7]:
# create empty dictionary
aggs = {}

# Continuous/discrete field statistics extraction range
for col in numeric_cols:
    aggs[col] = ['nunique', 'mean', 'min', 'max','var','skew', 'sum']
for col in categorical_cols:
    aggs[col] = ['nunique']    
aggs['card_id'] = ['size', 'count']
cols = ['card_id']

# Statistical calculation with groupby
for key in aggs.keys():
    cols.extend([key+'_'+stat for stat in aggs[key]])

df = transaction[transaction['month_lag']<0].groupby('card_id').agg(aggs).reset_index()
df.columns = cols[:1] + [co+'_hist' for co in cols[1:]]

df2 = transaction[transaction['month_lag']>=0].groupby('card_id').agg(aggs).reset_index()
df2.columns = cols[:1] + [co+'_new' for co in cols[1:]]
df = pd.merge(df, df2, how='left',on='card_id')

df2 = transaction.groupby('card_id').agg(aggs).reset_index()
df2.columns = cols
df = pd.merge(df, df2, how='left',on='card_id')
del transaction
gc.collect()

# 生成训练集与测试集
train = pd.merge(train, df, how='left', on='card_id')
test =  pd.merge(test, df, how='left', on='card_id')
del df
train.to_csv("preprocess/train_groupby.csv", index=False)
test.to_csv("preprocess/test_groupby.csv", index=False)

gc.collect()

View the basic situation of the data set

<center><img src="https://i.loli.net/2021/10/23/HpI1QuM6ZvtkS7f.png" alt="image-20211023162707542" style="zoom:67%;" />

### 3.data merge

Only by merging it can it be further brought in for modeling. The merging process is relatively simple. You only need to splice the train_dict(test_dict) and train_group(test_group) horizontally according to the card_id, and then remove the duplicate columns.

In [2]:
train_dict = pd.read_csv("preprocess/train_dict.csv")
test_dict = pd.read_csv("preprocess/test_dict.csv")
train_groupby = pd.read_csv("preprocess/train_groupby.csv")
test_groupby = pd.read_csv("preprocess/test_groupby.csv")

- remove duplicate columns

In [3]:
for co in train_dict.columns:
    if co in train_groupby.columns and co!='card_id':
        del train_groupby[co]
for co in test_dict.columns:
    if co in test_groupby.columns and co!='card_id':
        del test_groupby[co]

- Splicing features

In [4]:
train = pd.merge(train_dict, train_groupby, how='left', on='card_id').fillna(0)
test = pd.merge(test_dict, test_groupby, how='left', on='card_id').fillna(0)

>  the above operation fills the missing value with 0. The missing value here is not a real missing value. The missing value is just a value that has no statistical result during the feature creation process. Logically speaking, these values are actually 0. Therefore, the filling of missing values here is equivalent to data completion.

- Data storage and memory management

In [None]:
train.to_csv("preprocess/train.csv", index=False)
test.to_csv("preprocess/test.csv", index=False)

del train_dict, test_dict, train_groupby, test_groupby
gc.collect()