# 简介
在本部分，我们探索为Home Credit Default Risk竞赛题目手工创建特征。在前一个notebook中，我们仅使用贷款申请数据来构建模型。我们从这些数据中获得的最佳模型在排行榜上获得了大约0.74的分数。为了得到更高的分数，我们需要利用其他数据表中的信息。这里，我们会使用表 bureau 和 bureau_balance,这两个表达含义如下：
* bureau: 客户在以前在其他金融机构的贷款记录
* bureau_balance: 客户以前在其他经融机构的贷款的月度数据记录

手动特征工程是一个繁琐的过程，通常依赖于领域专业知识。由于我对信贷领域的专业知识有限，我将加入尽可能多的信息到最终的训练表中。我们的想法是，模型将会知道哪些特征重要，哪些不重要，而不需要我们自己来做决定。基本上，我们的方法是尽可能多地创建特征，然后将它们全部交给模型使用！之后，我们可以使用模型中的特征重要性或PCA等其他技术来执行特征降维。

In [None]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

# 客户之前的贷款记录计数
为了说明手工特征工程的一般过程，我们首先简单地了解一个客户以前在其他金融机构贷款的数量。

In [None]:
# Read in bureau
bureau = pd.read_csv('../input/bureau.csv')
bureau.head()

In [None]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()

In [None]:
# Join to the training dataframe
train = pd.read_csv('../input/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

# Fill the missing values with 0 
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()

# 用r值评估新变量的有用性
为了判断一个新变量是否有用，我们可以计算新变量和target的皮尔森相关系数（r值）。r值可以衡量两个变量之间的线性相关性，虽然他不是衡量一个变量是否有用的最佳指标，但是它可以给出变量是否有助于机器学习模型的第一个近似估计。r值绝对值越大，表明新变量越有可能影响target值。

我们还可以使用核密度估计（KDE）图来直观地观察变量与目标的关系。

## 核密度估计图
核密度估计图显示单个变量的分布（将其视为平滑的直方图）。要变量在不同target值下的分布，我们可以根据target不同的取值对分布进行不同的着色。

In [None]:
# Plots the disribution of a variable colored by value of the target
def kde_target(var_name, df):
    
    # Calculate the correlation coefficient between the new variable and the target
    corr = df['TARGET'].corr(df[var_name])
    
    # Calculate medians for repaid vs not repaid
    avg_repaid = df.ix[df['TARGET'] == 0, var_name].median()
    avg_not_repaid = df.ix[df['TARGET'] == 1, var_name].median()
    
    plt.figure(figsize = (12, 6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(df.ix[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.ix[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    
    # label the plot
    plt.xlabel(var_name)
    plt.ylabel('Density')
    plt.title('%s Distribution' % var_name)
    plt.legend()
    
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # Print out average values
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)

我们可以使用EXT_SOURCE_3变量来测试这个函数，我们在之前的notebook中发现它是最重要的变量之一。

In [None]:
kde_target('EXT_SOURCE_3', train)

In [None]:
kde_target('previous_loan_counts', train)

由此很难判断这个变量是否重要。 相关系数极弱，分布几乎没有明显差异。

接下来我们继续从bureau表中创建一些变量，对表中的每一个数值列进行 min，max，mean运算。

# 聚合数值列
为了使用bureau表中的数值信息，我们计算表中所有数值列的统计特征。我们按客户id进行 分组，然后在分组的数据上进行 聚合，结果合并到训练表中。

In [None]:
# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()

我们需要对这些列重新命名。

In [None]:
# List of column names
columns = ['SK_ID_CURR']

# Iterate through the variables names
for var in bureau_agg.columns.levels[0]:
    # Skip the id name
    if var != 'SK_ID_CURR':
        
        # Iterate through the stat names
        for stat in bureau_agg.columns.levels[1][:-1]:
            # Make a new column name for the variable and stat
            columns.append('bureau_%s_%s' % (var, stat))

In [None]:
bureau_agg.columns = columns
bureau_agg.head()

In [None]:
# Merge with the training data
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()

## 聚合值与target之间的相关性

In [None]:
# List of new correlations
new_corrs = []

# Iterate through the columns 
for col in columns:
    # Calculate correlation with the target
    corr = train['TARGET'].corr(train[col])
    
    # Append the list as a tuple

    new_corrs.append((col, corr))

In [None]:
# Sort the correlations by the absolute value
# Make sure to reverse to put the largest values at the front of list
new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True)
new_corrs[:15]

没有新变量与TARGET有显著的相关性。 我们可以看一下最高相关性变量bureau_DAYS_CREDIT_mean的KDE图。

In [None]:
kde_target('bureau_DAYS_CREDIT_mean', train)

此列的定义是：在其他金融机构申请上一笔贷款到在Home Credit申请贷款之间的天数。 因此，较大的负数表示贷款在当前贷款申请之前很久没申请过贷款。我们看到这个变量的平均值与target之间存在极其微弱的正相关关系，这意味着过去很久没申请贷款的客户更有可能在Home Credit偿还贷款。虽然这种相关性很弱，它也有可能是噪声。

# 数值聚合函数


In [None]:
def agg_numeric(df, group_var, df_name):
    """Aggregates the numeric values in a dataframe. This can
    be used to create features for each instance of the grouping variable.
    
    Parameters
    --------
        df (dataframe): 
            the dataframe to calculate the statistics on
        group_var (string): 
            the variable by which to group df
        df_name (string): 
            the variable used to rename the columns
        
    Return
    --------
        agg (dataframe): 
            a dataframe with the statistics aggregated for 
            all numeric columns. Each instance of the grouping variable will have 
            the statistics (mean, min, max, sum; currently supported) calculated. 
            The columns are also renamed to keep track of features created.
    
    """
    # Remove id variables other than grouping variable
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns = col)
            
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids

    # Group by the specified variable and calculate the statistics
    agg = numeric_df.groupby(group_var).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()

    # Need to create new column names
    columns = [group_var]

    # Iterate through the variables names
    for var in agg.columns.levels[0]:
        # Skip the grouping variable
        if var != group_var:
            # Iterate through the stat names
            for stat in agg.columns.levels[1][:-1]:
                # Make a new column name for the variable and stat
                columns.append('%s_%s_%s' % (df_name, var, stat))

    agg.columns = columns
    return agg

In [None]:
bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg_new.head()

为了确保函数按预期工作，我们应该与手工构建的聚合数据表进行比较。

In [None]:
bureau_agg.head()

# 相关系数函数

In [None]:
# Function to calculate correlations with the target for a dataframe
def target_corrs(df):

    # List of correlations
    corrs = []

    # Iterate through the columns 
    for col in df.columns:
        print(col)
        # Skip the target column
        if col != 'TARGET':
            # Calculate correlation with the target
            corr = df['TARGET'].corr(df[col])

            # Append the list as a tuple
            corrs.append((col, corr))
            
    # Sort by absolute magnitude of correlations
    corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
    
    return corrs

# 类别变量
对于离散的字符串变量，我们不能计算统计量，例如均值和最大值，它们仅适用于数值变量。相反，我们将对每个类别变量中每个类别的值计数。例如，如果我们有以下数据表：

SK_ID_CURR | Loan type |  
-|-|-
1 | home |
1 | home |
1 |	home |
1 | credit |
2 | credit |
3 | credit |
3 | cash |
3 |	cash |
4 | credit | 
4 | home |
4 | home |

每个类别贷款数量如下：

SK_ID_CURR | credit count | cash count | home count | total count
-|-|-|-|-
1 | 1 | 0 | 3 | 4 |
2|	1|	0|	0|	1|
3|	1|	2|	0|	3|
4|	1|	0|	2|	3|

然后对各个类别计数进行归一化：

SK_ID_CURR | credit count | cash count | home count | total count|credit count norm|cash count norm|	home count norm
-|-|-|-|-|-|-|
1 | 1 | 0 | 3 | 4 |0.25|	0|	0.75|
2|	1|	0|	0|	1|1.00|	0|	0|
3|	1|	2|	0|	3|0.33|	0.66|	0|
4|	1|	0|	2|	3|0.33|	0|	0.66|**

首先，对数据类型为‘object'的列进行one_hot编码

In [None]:
categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()

In [None]:
# 求和与归一化
categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()

对列进行重命名

In [None]:
categorical_grouped.columns.levels[0][:10]

In [None]:
categorical_grouped.columns.levels[1]

In [None]:
group_var = 'SK_ID_CURR'

# Need to create new column names
columns = []

# Iterate through the variables names
for var in categorical_grouped.columns.levels[0]:
    # Skip the grouping variable
    if var != group_var:
        # Iterate through the stat names
        for stat in ['count', 'count_norm']:
            # Make a new column name for the variable and stat
            columns.append('%s_%s' % (var, stat))

#  Rename the columns
categorical_grouped.columns = columns

categorical_grouped.head()

sum列记录计数，而mean列记录归一化后的计数。

我们可以将此数据框合并到训练数据中。

In [None]:
train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR', right_index = True, how = 'left')
train.head()

In [None]:
train.shape

In [None]:
train.iloc[:5, 123:]

## 处理类别变量的函数
把类别变量的处理封装成函数

In [None]:
def count_categorical(df, group_var, df_name):
    """Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    --------
    df : dataframe 
        The dataframe to calculate the value counts for.
        
    group_var : string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    df_name : string
        Variable added to the front of column names to keep track of columns

    
    Return
    --------
    categorical : dataframe
        A dataframe with counts and normalized counts of each unique category in every categorical variable
        with one row for every unique value of the `group_var`.
        
    """
    # Select the categorical columns
    categorical = pd.get_dummies(df.select_dtypes('object'))

    # Make sure to put the identifying id on the column
    categorical[group_var] = df[group_var]

    # Groupby the group var and calculate the sum and mean
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    column_names = []
    
    # Iterate through the columns in level 0
    for var in categorical.columns.levels[0]:
        # Iterate through the stats in level 1
        for stat in ['count', 'count_norm']:
            # Make a new column name
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical

In [None]:
bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()

## 处理其他数据表
bureau balance 包含每个客户以前在其他金融机构的贷款的月度信息。我们将首先按SK_ID_BUREAU（即先前贷款的ID）对数据帧进行分组，而不是按客户的ID(ID SK_ID_CURR)分组。这将为每笔贷款提供一行数据。 然后，我们可以按SK_ID_CURR进行分组，并计算每个客户贷款的汇总。最终结果将是每个客户一行数据，并计算其贷款的统计数据。

In [None]:
# Read in bureau balance
bureau_balance = pd.read_csv('../input/bureau_balance.csv')
bureau_balance.head()

In [None]:
# Counts of each type of status for each previous loan
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()

In [None]:
# 处理数值型变量
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()

现在我们需要为每个客户聚合贷款信息。首先将数据表合并在一起，由于所有变量都是数字，我们只需要再次聚合统计数据，按SK_ID_CURR进行分组。

In [None]:
# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], on = 'SK_ID_BUREAU', how = 'left')

bureau_by_loan.head()

In [None]:
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()

* client_bureau_balance_MONTHS_BALANCE_mean_mean：对于每笔贷款，计算MONTHS_BALANCE的平均值。 然后为每个客户计算所有贷款的该值的平均值。
* client_bureau_balance_STATUS_X_count_norm_sum：对于每笔贷款，计算STATUS == X的出现次数除以贷款的总STATUS值的数量。 然后，对每个客户贷款的值求和。

# 重新用函数执行
 让我们重置所有变量，然后使用我们构建的函数从头开始执行此操作。

In [None]:
# Free up memory by deleting old objects
import gc
gc.enable()
del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()

In [None]:
# Read in new copies of all the dataframes
train = pd.read_csv('../input/application_train.csv')
bureau = pd.read_csv('../input/bureau.csv')
bureau_balance = pd.read_csv('../input/bureau_balance.csv')

### Counts of Bureau Dataframe

In [None]:
bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()

### Aggregated Stats of Bureau Dataframe

In [None]:
bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()

### Value counts of Bureau Balance dataframe by loan

In [None]:
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()

### Aggregated stats of Bureau Balance dataframe by loan

In [None]:
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()

### Aggregated Stats of Bureau Balance by Client

In [None]:
# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')

# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')

# 将计算出的特征插入训练数据表中

In [None]:
original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))

In [None]:
# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

In [None]:
new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))

# 特征工程成果
完成所有这些工作之后，现在我们想看看我们创建的变量。我们可以看一下缺失值的百分比，变量与目标的相关性，以及变量与其他变量的相关性。变量之间的相关性可以显示我们是否具有共线变量，即彼此高度相关的变量。通常，我们希望删除一对共线变量中的一个。我们还可以使用缺失值的百分比来删除缺失值过多的变量。

特征选择将是一个重要的焦点，因为减少特征的数量可以帮助模型在训练期间更好地学习，并且更好地泛化到测试数据。“维度的诅咒”是由于具有太多特征（维度太高）而导致的问题。随着变量数量的增加，学习这些变量与目标值之间关系所需的数据点数量呈指数增长。

### 缺失值
删除具有太多缺失值的列。

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
missing_train = missing_values_table(train)
missing_train.head(10)

我们看到有许多列具有较高的缺失值百分比。没有明确的阈值来删除缺失值，最佳的行动方案取决于具体问题。在这里，为了减少特征的数量，我们将删除训练或测试数据中缺失值超过90％的列。

In [None]:
missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)

在删除缺失值之前，我们将在测试数据中找到缺失的值百分比。 然后，我们将删除训练或测试数据中缺失值超过90％的任何列。 现在让我们读入测试数据，执行相同的操作，并查看测试数据中的缺失值。 我们已经计算了所有计数和聚合统计数据，因此我们只需要将测试数据与适当的数据合并。

# 处理测试数据

In [None]:
# Read in the test dataframe
test = pd.read_csv('../input/application_test.csv')

# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

In [None]:
print('Shape of Testing Data: ', test.shape)

In [None]:
train_labels = train['TARGET']

# Align the dataframes, this will remove the 'TARGET' column
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels

print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)

In [None]:
missing_test = missing_values_table(test)
missing_test.head(10)

In [None]:
missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)

In [None]:
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))

In [None]:
# Drop the missing columns
train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)

我们最终在此回合中没有删除任何列，因为没有列缺失值超过90％的列。 我们可能必须应用另一种特征选择方法来减少维度。

In [None]:
train.to_csv('train_bureau_raw.csv', index = False)
test.to_csv('test_bureau_raw.csv', index = False)

# 相关性
首先让我们看一下变量与目标的相关性。可以看到，我们创建的任何变量都具有比最初的训练数据中已存在的变量更大的相关性。

In [None]:
# Calculate all correlations in dataframe
corrs = train.corr()

corrs = corrs.sort_values('TARGET', ascending = False)

# Ten most positive correlations
pd.DataFrame(corrs['TARGET'].head(10))

In [None]:
# Ten most negative correlations
pd.DataFrame(corrs['TARGET'].dropna().tail(10))

与目标的最高相关变量是我们创建的变量。 然而，仅仅因为变量是相关的并不意味着它将是有用的，我们必须记住，如果我们生成数百个新变量，一些变量将与目标相关，仅仅是因为随机噪声。

看起来确实有几个新创建的变量可能有用。 为了评估变量的“有用性”，我们将查看模型返回的特征重要性。我们可以创建两个新创建的变量的kde图看看。

In [None]:
kde_target(var_name='bureau_DAYS_CREDIT_mean', df=train)

In [None]:
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)

### 共线变量（Collinear Variables）

In [None]:
# Set the threshold
threshold = 0.8

# Empty dictionary to hold correlated variables
above_threshold_vars = {}

# For each column, record the variables that are above the threshold
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

对于这些高度相关变量对中的每一对，我们只想删除其中一个变量。 以下代码通过仅添加每对中的一个来创建要删除的变量集。

In [None]:
# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

# Iterate through columns and correlated columns
for key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Only want to remove one in a pair
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

我们可以从训练和测试数据集中删除这些列，然后对删除这些变量后将性能与保持这些变量（我们之前保存的原始csv文件）的性能进行比较。

In [None]:
train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)

print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)

In [None]:
train_corrs_removed.to_csv('train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('test_bureau_corrs_removed.csv', index = False)

# 模型

* control: only the data in the **application** files.
* test one: the data in the application files with all of the data recorded from the **bureau** and **bureau_balance** files
* test two: the data in the application files with all of the data recorded from the **bureau** and **bureau_balance** files with highly correlated variables removed.

In [None]:
import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt

In [None]:
def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

In [None]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

### Control
任何实验的第一步是建立一个控制变量。 为此，我们将使用上面定义的函数（定义模型）和单个主数据源（application表）。

In [None]:
train_control = pd.read_csv('../input/application_train.csv')
test_control = pd.read_csv('../input/application_test.csv')

In [None]:
submission, fi, metrics = model(train_control, test_control)

In [None]:
metrics

出现了过拟合现象，因为训练分数比验证分数高很多。 我们可以在后面的笔记本中解决这个问题（我们已经通过使用reg_lambda和reg_alpha以及提前停止在此模型中采取了一些正则化措施）。

In [None]:
fi_sorted = plot_feature_importances(fi)

In [None]:
submission.to_csv('control.csv', index = False)

The control scores **0.745** when submitted to the competition.

### Test One

In [None]:
submission_raw, fi_raw, metrics_raw = model(train, test)

In [None]:
metrics_raw

In [None]:
fi_raw_sorted = plot_feature_importances(fi_raw)

看起来我们构建的一些特征在最重要的特征中。 让我们找一下制作的前100个最重要特征的百分比。但是，我们需要与独热编码的原始特征进行比较，而不是仅仅与原始特征进行比较。这些已经记录在fi中（来自原始数据）。

In [None]:
top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))

重要性前100的特征中超过一半是由我们制作的！这应该让我们相信，我们所做的所有努力都是值得的。

In [None]:
submission_raw.to_csv('test_one.csv', index = False)

**Test one scores 0.759 when submitted to the competition.**

### Test Two

In [None]:
submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)

In [None]:
metrics_corr

These results are better than the control, but slightly lower than the raw features.

In [None]:
fi_corrs_sorted = plot_feature_importances(fi_corrs)

In [None]:
submission_corrs.to_csv('test_two.csv', index = False)

**Test Two scores 0.753 when submitted to the competition.**

# 结果
完成所有这些工作之后，我们可以说包含额外信息确实提高了性能！让我们正式总结一下：

Experiment| Train AUC|	Validation AUC|	Test AUC
-|-|-|-|
Control|	0.815|	0.760|	0.745|
Test One|	0.837|	0.767|	0.759|
Test Two|	0.826|	0.765|	0.753|

我们所有的努力工作都转化为比原始测试数据高0.014 ROC AUC的改进。 删除高共线变量会略微降低性能，因此我们需要考虑不同的特征选择方法。 此外，我们可以说，我们构建的一些特征是模型判断最重要的特征之一。

# 下一步
我们现在可以在其他数据集中使用我们在此笔记本中开发的特征。我们的模型中还有4个其他数据文件可供使用！在下一个笔记本中，我们将把这些数据文件（包含Home Credit的先前贷款信息）中的信息合并到我们的训练数据中。然后我们可以构建相同的模型并运行更多实验来确定我们的特征工程的效果。