学习python一段时间了，但纸上得来终觉浅，需要做项目检验一下学习成果。以下是Home Credit Default Risk的项目笔记。选取这个项目的原因是因为数据表格比较复杂且贴合实际。在学习实践过程中也研读许多kaggle上高手的kernels，站在巨人的肩膀上看世界，实在受益匪浅，谢谢。

## 一、探索数据（train 和 test 表）

### 1-1 导入所需要模块

In [None]:
#基本模块
import numpy as np
import pandas as pd
#画图模块
import matplotlib.pyplot as plt
import seaborn as sns
#模型训练前把数据分组用的train_test_split, 计算效果得分roc_auc_score, 用到基础模型xgboost,
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score 
import xgboost 
#忽略wainings
import warnings
warnings.filterwarnings('ignore')
#清理内存
import gc
#图表显示设置
%matplotlib inline
#打印数据表格的文件
import os
os.listdir("../input")

### 1-2 导入train 和test 表


In [None]:
df_train = pd.read_csv('../input/application_train.csv')
df_test  = pd.read_csv('../input/application_test.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_all = pd.concat([df_train.loc[: , 'SK_ID_CURR':'AMT_REQ_CREDIT_BUREAU_YEAR'],
                   df_test.loc[: , 'SK_ID_CURR':'AMT_REQ_CREDIT_BUREAU_YEAR']])
df_all = df_all.reset_index(drop = True)
df_all.drop('TARGET', axis = 1, inplace = True)

In [None]:
print(df_train.shape, df_test.shape, df_all.shape)

### 1-3 查看空缺值的列及空缺率

In [None]:
#查找空缺值的列及其占比
def missing_values_table(df):
    miss_value = df.isnull().sum()
    miss_val_percent = 100 * df.isnull().sum() / len(df)
    miss_table = pd.concat([miss_value,miss_val_percent], axis = 1)
    miss_table = miss_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    miss_table = miss_table[miss_table.iloc[: , 1]!= 0].sort_values('% of Total Values', ascending=False).round(2)
    return miss_table

In [None]:
missing_values_table(df_train).head(10)

In [None]:
missing_values_table(df_test).head(10)

**可以看到train和test表的空缺列基本一致，前十位空缺列都接近70%的比率。**

### 1-4 查看一下train表的target分布

In [None]:
distribution_of_target = df_train['TARGET'].value_counts()
print(distribution_of_target)

In [None]:
sns.countplot(x = 'TARGET',data = df_train)
perc_target = (100 * df_train['TARGET'].sum() / df_train['TARGET'].count()).round(4)
print('percentage of default : %0.2f%%' %  perc_target)

**train表的用户逾期率为8.07%。**

## 2、探索式数据分析


### 2-1 年龄分布与逾期情况

In [None]:
import seaborn as sns

diverging_colors = sns.color_palette("RdBu", 10)
def change_width(ax, new_value) :
    for patch in ax.patches :
        current_width = patch.get_width()
        diff = current_width - new_value

        # we change the bar width
        patch.set_width(new_value)

        # we recenter the bar
        patch.set_x(patch.get_x() + diff * .5)

In [None]:
#对年龄进行分区
age = df_train[['DAYS_BIRTH', 'TARGET']]
age['YEARS_BIRTH'] = age['DAYS_BIRTH'] / -365
age['YEARS_BINNED'] = pd.cut(age['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_groups = age.groupby('YEARS_BINNED').mean()
#作图
plt.figure(figsize = (15, 8))
# plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'], color = 'cornflowerblue', width = 0.3, alpha = 0.7, edgecolor = 'w')
# plt.title('Default Rate of Age Group',fontsize = 15)
# plt.xlabel('Age Group (Years)',fontsize = 12)
# plt.ylabel('Default Rate (%)',fontsize = 12)
# plt.xticks(rotation = 75)
# plt.show()
diverging_colors = sns.color_palette("RdBu", 10)
ax = sns.barplot(age_groups.index.astype(str), 100 * age_groups['TARGET'], palette=diverging_colors,dodge=True, edgecolor = 'w')
# ax.set(title='Default Rate of Age Group', xlabel='Age Group (Years)', ylabel='Default Rate (%)')
change_width(ax, .55)
plt.title('Default Rate of Age Group',fontsize = 15)
plt.xlabel('Age Group (Years)',fontsize = 12)
plt.ylabel('Default Rate (%)',fontsize = 12)
plt.xticks(rotation = 75)
plt.show()

上图可见，随着贷款用户年龄的增大，违约率呈下降趋势。年龄段在20-25岁违约率最高，超过12%。

### 2-2 职业类型与违约情况

In [None]:
#（OCCUPATION_TYPE）
tem = df_train.loc[:, ['OCCUPATION_TYPE', 'TARGET']]
tem.fillna('none', inplace = True)
#tem1表示各个职位中逾期的人数
tem1 = tem.groupby('OCCUPATION_TYPE').sum( )
#tem2表示各个职业的总人数
tem2 = tem.groupby('OCCUPATION_TYPE').count( )
#tem3表示各个职位逾期比率
tem3 =  tem1 / tem2 * 100
tem3.reset_index(inplace = True)
tem2.reset_index(inplace = True)
fig = plt.figure(figsize = (15, 8))
# ax1 = fig.add_subplot(111)
#Count
sequential_colors = sns.color_palette("RdPu", 20)
qualitative_colors = sns.color_palette("Set3", 10)

ax1=sns.barplot(x='OCCUPATION_TYPE', y='TARGET', data = tem2, edgecolor = 'w', label = 'Count', palette=sequential_colors)
plt.xlabel('Occupation Type', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
plt.xticks(rotation = 75, fontsize = 12)
#设置双坐标轴，右侧Y轴
ax2 = ax1.twinx( )
#设置右侧Y轴显示百分数
import matplotlib.ticker as mtick
fmt = '%0.2f%%'
yticks = mtick.FormatStrFormatter(fmt)
#Default Rate
sns.set()
# sns.kdeplot(tem3['OCCUPATION_TYPE'], tem3['TARGET'], label = 'Default Rate', linewidth = 3, color = 'orange', alpha = 0.7)
ax2.plot(tem3['OCCUPATION_TYPE'], tem3['TARGET'], label = 'Default Rate', linewidth = 3, color = 'orange', alpha = 0.7)
ax2.yaxis.set_major_formatter(yticks)
ax1.set_ylabel('Count', fontsize = 12)
ax2.set_ylabel('Default Rate', fontsize = 12)
legend1 = ax1.legend(loc = (.02,.94), fontsize = 10, shadow = True)
legend2 = ax2.legend(loc = (.02,.9), fontsize = 10, shadow = True)
plt.title('Count & Default Rate of Occupation Type', fontsize = 15) 
plt.show( )

由图可见，未标示职业类型的人数最多，接近10万人，其次分别是工人（Laborers）、销售人员（Sales Staff）、骨干员工（Core Staff）和经理（Managers），基本与房贷产品受众对象一致。

逾期率最高的职业是服务员/酒吧员工（Waiters/Barmen Staff）,接近17.5%。超过10%违约率的职业有：工人（Laborers）、司机（Drivers）、后厨员工（Cooking Staff）、安保人员（Security Staff）。一般而言以上职业相对收入比较低，还款能力较弱。

### 2-3 受雇佣时长与违约情况

In [None]:
#（DAYS_EMPLOYED）
tem = df_train.loc[:, ['DAYS_EMPLOYED', 'TARGET']]
#发现异常值DAYS_EMPLOYED = 365243，取出单独分析
tem1 = tem[tem['TARGET'] == 1]['DAYS_EMPLOYED']
tem1_norm = tem1[tem1 != 365243] / -365
tem1_a_norm = tem1[tem1 == 365243] / 365
tem0 = tem[tem['TARGET'] == 0]['DAYS_EMPLOYED']
tem0_norm = tem0[tem0 != 365243] / -365
tem0_a_norm = tem0[tem0 == 365243] / 365
#Nomal
sns.set_style("ticks")
fig, ax = plt.subplots(2, 1, figsize = (15, 12))
sns.distplot(tem1_norm.values, bins = 100, label = '1',kde=True, color = 'salmon',ax=ax[0])
# ax[0].hist(tem1_norm.values, bins = 100, label = '1',density = True, alpha = 0.8, color = 'salmon', edgecolor = 'w')
sns.distplot(tem0_norm.values, bins = 100, label = '0',kde=True, color = 'cornflowerblue', ax=ax[0])
sns.despine()
ax[0].set_title('Distribution  of  Years_Employed (Nomal)', fontsize = 15)
ax[0].set_xlabel('Years_Employed', fontsize = 12)
ax[0].set_ylabel('Probability', fontsize = 12)
ax[0].legend(fontsize = 'large')
# change_width(ax[0], .75)
# #Abnomal
# sns.distplot(tem1_a_norm.values, bins = 3, label = '1',kde=True, color = 'salmon',ax=ax[1])
# sns.distplot(tem0_a_norm.values, bins = 3, label = '0',kde=True, color = 'cornflowerblue', ax=ax[1])
ax[1].hist(tem1_a_norm.values, bins = 3, label = '1',density = False, alpha = 0.8, color = 'salmon')
ax[1].hist(tem0_a_norm.values, bins = 3, label = '0',density = False, alpha = 0.7, color = 'cornflowerblue')

ax[1].set_title('Distribution  of Years_Employed (Abnomal)', fontsize = 15)
ax[1].set_xlabel('YEARS_EMPLOYED', fontsize = 12)
ax[1].set_ylabel('Count', fontsize = 12)
ax[1].legend(fontsize = 'large')
plt.show()

由图可见，受雇佣年限越短，逾期率越高。贷款用户工作年限越短（0-5年），越容易发生逾期。随着工作年限的增长，积累资本趋于稳定，贷款用户逾期率明显下降。

针对异常值（DAYS_EMPLOYED = 365243）单独分析可知，处于该异常值的贷款用户发生逾期的概率很低，这是个表现能力很好的特殊标识，不能轻易删除。

In [None]:
tem = df_train.loc[:, ['AMT_INCOME_TOTAL','TARGET']]
tem.describe()

发现总收入的极大值远远偏离总体水平，对作图影响较大，以下取整体水平前98%的数据进行画图分析。

In [None]:
#tem1表示逾期客户的年总收入，tem0表示正常客户的年总收入
tem1 = tem[tem['TARGET'] == 1]['AMT_INCOME_TOTAL']
tem0 = tem[tem['TARGET'] == 0]['AMT_INCOME_TOTAL']
#分别取逾期和正常客户年总收入的前98%数据
tem1_98 = tem1[tem1<= np.percentile(tem1, 98)]
tem0_98 = tem0[tem0<= np.percentile(tem0, 98)]
sns.set_style("ticks")
fig, ax = plt.subplots(figsize = (15, 8), sharex = True)
sns.distplot(tem1_98.values, bins = 50, label = '1', kde=True, color = 'gold', ax=ax)
sns.distplot(tem0_98.values, bins = 50, label = '0', kde=True, color = 'mediumpurple', ax=ax)
sns.despine()
change_width(ax, 5000)
plt.title('Distribution  of  Amt_Income_Total', fontsize = 15)
plt.xlabel('AMT_INCOME_TOTAL', fontsize = 12)
plt.ylabel('Probability', fontsize = 12)
plt.legend(fontsize = 'large')
plt.show()

由图可见，随着年收入的增长，逾期比率逐渐下降，在年收入10--20万的区间内违约率较高。

### 2-5 负债水平与违约情况

In [None]:
tem = df_train.loc[:, ['AMT_CREDIT', 'AMT_INCOME_TOTAL', 'TARGET'] ]
#贷款占总收入的比率
tem['CreditToIncomeRatio'] = df_train['AMT_CREDIT'] / df_train['AMT_INCOME_TOTAL']
tem.describe()

发现债务收入比的极大值远远偏离总体水平，对作图影响较大，以下取整体水平前98%的数据进行画图分析。

In [None]:
tem1 = tem[tem['TARGET'] == 1]['CreditToIncomeRatio']
tem0 = tem[tem['TARGET'] == 0]['CreditToIncomeRatio']
#分别取逾期和正常客户年债务收入比的前98%数据
tem1_98 = tem1[tem1 <= np.percentile(tem1, 98)]
tem0_98 = tem0[tem0 <= np.percentile(tem0, 98)]
fig, ax = plt.subplots(figsize =(15, 8), sharex = True)
# plt.hist(tem1_98.values, bins = 90, label = '1', density = True, alpha = 0.7, color = 'salmon', edgecolor = 'w')
# plt.hist(tem0_98.values, bins = 90, label = '0', density = True, alpha = 0.6, color = 'cornflowerblue', edgecolor = 'w')
sns.distplot(tem1_98.values, bins = 90, label = '1', kde=True, color = 'salmon')
sns.distplot(tem0_98.values, bins = 90, label = '0', kde=True, color = 'cornflowerblue')
sns.despine()
plt.title('Distribution  of  Credit_To_Income_Ratio', fontsize = 15)
plt.xlabel('Credit To Income Ratio', fontsize = 12)
plt.ylabel('Probability', fontsize = 12)
plt.legend(fontsize = 'large')
plt.show()

按照常识来说债务收入比低的人更具备还款能力。从图可见，债务收入比与违约情况变化并没有显著统一的规律。以债务收入比等于2为分水岭，债务收入比小于2的贷款用户违约率偏低；在2到6的债务收入比区间内，贷款用户的违约率偏高；债务收入比大于6的违约率呈回落趋势。

其中贷款金额远大于年收入的贷款人毁约率偏低，说明这批贷款用户的信用水平较高，违约风险偏低，能借到更多的资金。

## 3、风险及业务相关指标分析
### 3-1 信贷局查询贷款人记录的次数与违约情况

In [None]:
#算出每位贷款人被查询的总次数，历史无查询记录用-1表示
e = df_train.loc[:, [ 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
                     'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'TARGET'] ]
e['AMT_REQ_CREDIT_BUREAU_TOTAL'] = e['AMT_REQ_CREDIT_BUREAU_HOUR'] + e['AMT_REQ_CREDIT_BUREAU_DAY'] + \
    e['AMT_REQ_CREDIT_BUREAU_WEEK'] + e['AMT_REQ_CREDIT_BUREAU_MON'] + e['AMT_REQ_CREDIT_BUREAU_QRT'] + e['AMT_REQ_CREDIT_BUREAU_YEAR']
e = e.loc[:, ['AMT_REQ_CREDIT_BUREAU_TOTAL', 'TARGET'] ]
e.fillna(-1, inplace = True)
#e1表示各个被查次数中逾期的总人数
e1 = e.groupby('AMT_REQ_CREDIT_BUREAU_TOTAL').sum()
#e2表示各个被查次数的总人数
e2 = e.groupby('AMT_REQ_CREDIT_BUREAU_TOTAL').count()
#e3表示各个被查次数的逾期比率
e3 =  e1 / e2 * 100
e3.reset_index(inplace = True)
e2.reset_index(inplace = True)
#剔除异常值后
e2 = e2[e2['AMT_REQ_CREDIT_BUREAU_TOTAL'] < 100]
e3 = e3[e3['AMT_REQ_CREDIT_BUREAU_TOTAL'] < 100]
fig = plt.figure(figsize = (15, 8))
ax1 = fig.add_subplot(111)
#count
sns.barplot('AMT_REQ_CREDIT_BUREAU_TOTAL', 'TARGET', data = e2, label = 'Count', palette = qualitative_colors,ax = ax1)
# ax1.bar('AMT_REQ_CREDIT_BUREAU_TOTAL', 'TARGET', data = e2, width = 0.6, label = 'Count', edgecolor = 'w', color = 'violet', alpha = 0.7)
plt.xlabel('Amt_Req_Credit_Bureau_Total', fontsize = 12)
plt.ylabel('Count', fontsize = 12)
#设置双坐标轴，右侧Y轴
ax2 = ax1.twinx( )
#设置右侧Y轴显示百分数
import matplotlib.ticker as mtick
fmt = '%0.2f%%'
yticks = mtick.FormatStrFormatter(fmt)
#default rate

sns.lineplot(data = e3, x='AMT_REQ_CREDIT_BUREAU_TOTAL' , y='TARGET' , label = 'Default Rate', linewidth=3)
# ax2.plot(e3['AMT_REQ_CREDIT_BUREAU_TOTAL'], e3['TARGET'], label = 'Default Rate', linewidth = 3, color = 'mediumslateblue')
ax2.yaxis.set_major_formatter(yticks)
ax1.set_ylabel('Count', fontsize = 12)
ax2.set_ylabel('Default Rate', fontsize = 12)
legend1=ax1.legend(loc=(0.8, .9), fontsize = 10, shadow = True)
legend2=ax2.legend(loc=(0.8, .85), fontsize = 10, shadow = True)
plt.title('Count & Default Rate  of  Amt_Req_Credit_Bureau_Total', fontsize = 15) 
plt.show( )

由图可见，从被查询次数来看，分布总体呈右偏趋势，其中位于-1（无查询记录）至6的区间内的人数居多，均大于10000人次。 从毁约率来看，被查次数在6到30的区间人数极少且波动非常明显，不可信，因此对不做分析。 违约率先降后平缓回升，无查询记录（-1）的人数偏多且毁约率高于10%；在0至6的查询区间内趋于平缓上升，毁约率在8%左右。

In [None]:
del age_groups, e1, 
gc.collect()

### 3-2 历史借款次数与违约情况

In [None]:
# 导入历史申请表
pa = pd.read_csv('../input/previous_application.csv')

In [None]:
tem = pa.loc[:, ['SK_ID_CURR', 'SK_ID_PREV']]
tem = tem.groupby(['SK_ID_CURR']).count().reset_index()
tem = df_train.loc[:, ['SK_ID_CURR','TARGET']].merge(tem, how = 'left', on = 'SK_ID_CURR')
tem = tem.loc[:, ['SK_ID_PREV','TARGET']]
#tem1表示各个借款次数下的总人数
tem1 = tem.groupby('SK_ID_PREV').count()
#tem2表示各个借款次数下的违约总人数
tem2 = tem.groupby('SK_ID_PREV').sum()
#f3表示各个借款次数下的违约率
tem3 = tem2 / tem1 * 100
tem3.reset_index(inplace = True)
tem1.reset_index(inplace = True)
fig = plt.figure(figsize = (15, 8))
ax1 = fig.add_subplot(111)
#由于次数大于20的人数较少且毁约率波动很大，只选取小于20的贷款次数进行作图
sns.barplot('SK_ID_PREV', 'TARGET',  data = tem1[tem1['SK_ID_PREV'] < 21], label = 'Count', palette=qualitative_colors, ax=ax1)
sns.despine()
plt.xlabel('Times Of Previous Loans (History)', fontsize = 12)

plt.ylabel('Count', fontsize = 12)
#设置双坐标轴，右侧Y轴
ax2 = ax1.twinx( )
#设置右侧Y轴显示百分数
import matplotlib.ticker as mtick
fmt = '%0.2f%%'
yticks = mtick.FormatStrFormatter(fmt)
#Default Rate
sns.lineplot(x='SK_ID_PREV', y='TARGET', data=tem3[tem3['SK_ID_PREV'] < 21] ,label = 'Default Rate', linewidth = 3, color = 'cornflowerblue',ax=ax2)
ax2.yaxis.set_major_formatter(yticks)
ax2.set_ylabel('Default Rate', fontsize = 12)
legend1=ax1.legend(loc=(0.75, .94), fontsize = 10, shadow = True)
legend2=ax2.legend(loc=(0.75, .9), fontsize = 10, shadow = True)
plt.title('Count & Default Rate  of History Previous Loans Times', fontsize = 15) 
plt.show( )

上图可见，随着历史贷款次数的上升，贷款人数逐渐减小。 历史借款次数范围在0-10内，毁约率呈先下降后回升的趋势，其中历史借了4次款的贷款用户毁约率最低,为7.56%。

In [None]:
del  tem, tem1, tem2, pa
gc.collect()

### 3-3 历史违约总天数与违约情况
导入分期付款表（installments_payments），由于表中出现同一还款期内多次还款，先对该表做如下整理：同期还款金额求和，同期的还款日选取最近的日期。

In [None]:
ip = pd.read_csv('../input/installments_payments.csv')

In [None]:
# ip = pd.read_csv('../input/installments_payments.csv')
ip_agg = ip.groupby(['SK_ID_CURR', 'SK_ID_PREV', 'NUM_INSTALMENT_NUMBER'], as_index = False).agg({'DAYS_INSTALMENT' : ['mean'],
                                                         'DAYS_ENTRY_PAYMENT' : ['max'],
                                                         'AMT_INSTALMENT' : ['mean'], 
                                                         'AMT_PAYMENT' : 'sum'})
tem = [ ] 
for i in ip_agg.columns.tolist():
    tem.append(i[0])
ip_agg.columns = pd.Index(tem)
#新增一列DEFAULT_DAY，表示每期的逾期天数
ip_agg['DEFAULT_DAY'] = ip_agg['DAYS_ENTRY_PAYMENT'] - ip_agg['DAYS_INSTALMENT']
#新增一列['DEFAULT_AMT'],表示每期逾期金额
ip_agg['DEFAULT_AMT'] = ip_agg['AMT_INSTALMENT'] - ip_agg['AMT_PAYMENT']
#将未到期且尚未还款的记录的逾期天数标成0
ip_agg['DEFAULT_DAY'][(ip_agg['DAYS_INSTALMENT']> -30) & (ip_agg['DAYS_ENTRY_PAYMENT'].isnull())] = 0
#将正常提前还款的记录的逾期天数标记成0
ip_agg['DEFAULT_DAY'][ip_agg['DEFAULT_DAY'].isnull()] = 0
#逾期天数大于0的记录
tem = ip_agg[ip_agg['DEFAULT_DAY'] > 0]
#逾期天数大于0的记录中，计算总天数和平均逾期天数
default_days_agg = tem.groupby('SK_ID_CURR', as_index = False).agg({'DEFAULT_DAY' : ['sum', 'mean']})
tem = [ ] 
tem.append('SK_ID_CURR')
for i in default_days_agg.columns.tolist():
     if i[0] != 'SK_ID_CURR':
        tem.append(i[0] + '_' + i[1])
default_days_agg.columns = pd.Index(tem)
tem = df_train.loc[: , ['SK_ID_CURR','TARGET']].merge(default_days_agg, how = 'left', on = 'SK_ID_CURR')
#正常还款和无历史记录的用户记录
tem_nan = tem[tem['DEFAULT_DAY_sum'].isnull()]
tem_nan = tem_nan['TARGET'].value_counts().tolist()

#历史发生逾期的用户记录
tem1 = tem[~tem['DEFAULT_DAY_sum'].isnull()]
g1 = tem1[tem1['TARGET'] == 1]['DEFAULT_DAY_sum']
g0 = tem1[tem1['TARGET'] == 0]['DEFAULT_DAY_sum']
#历史发生逾期的用户记录中，分别取逾期和正常客户前80%数据
g1_80 = g1[g1<= np.percentile(g1, 80)]
g0_80 = g0[g0<= np.percentile(g0, 80)]

In [None]:
fig, ax = plt.subplots(1, 2, figsize =(18, 8))
#历史发生逾期的用户
sns.distplot(g1_80.values, bins = 50, label = '1', ax=ax[0])
sns.distplot(g0_80.values, bins = 50, label = '0', ax=ax[0])
sns.despine()
# ax[0].hist(g1_80.values, bins = 50, label = '1', density = True, alpha = 0.8, color = 'gold', edgecolor = 'w')
# ax[0].hist(g0_80.values, bins = 50, label = '0', density = True, alpha = 0.6, color = 'cornflowerblue', edgecolor = 'w')
ax[0].set_title('History Repayment Record Is Not Null', fontsize = 10)
ax[0].set_xlabel('Default_Day_Sum', fontsize = 12)
ax[0].set_ylabel('Probability', fontsize = 12)
ax[0].legend(fontsize = 'small')
change_width(ax[0], 1.2)
explode = [0.2, 0]
#历史正常还款和无历史记录的用户
ax[1].pie(tem_nan, labels = ['0','1'],explode=explode, autopct= '%1.1f%%', startangle = 90 ,colors =   ['lightsteelblue', 'navajowhite'], shadow = False )
ax[1].set_title('Normal Repayment or History Repayment Record Is Null ', fontsize = 10)
fig.suptitle('Distribution  of  Default_Day_Sum', fontsize = 15)
plt.show()

上图可见，左图为历史发生逾期的贷款用户逾期总天数及违约分布，在历史逾期总天数偏小（约1-10天内）的时逾期率偏低，处于中段（11-60天内）的波动较大，逾期率总体持平。

右图是历史正常还款或者无历史记录的贷款用户的逾期分布，可以发现当前逾期率很低，在6.7%，说明历史重视还款时间的贷款人在后续贷款还款时不会轻易发生逾期。

### 3-4 历史违约平均金额与违约情况

In [None]:
#逾期金额大于1的记录
tem = ip_agg[ip_agg['DEFAULT_AMT'] > 1]
#逾期期数中的总金额和平均逾期金额
default_amount_agg = tem.groupby('SK_ID_CURR', as_index = False).agg({'DEFAULT_AMT':['sum', 'mean']})
tem = [ ] 
tem.append('SK_ID_CURR')
for i in default_amount_agg.columns.tolist():
     if i[0] != 'SK_ID_CURR':
        tem.append(i[0] + '_' + i[1])
default_amount_agg.columns = pd.Index(tem)
h = df_train.loc[:, ['SK_ID_CURR','TARGET']].merge(default_amount_agg, how = 'left', on = 'SK_ID_CURR')
##无历史记录或正常还款的用户记录
h_nan = h[h['DEFAULT_AMT_mean'].isnull()]
h_nan = h_nan['TARGET'].value_counts().tolist()
#有历史逾期的用户记录
hh = h[~h['DEFAULT_AMT_mean'].isnull()]
#有历史逾期的用户记录中，分别取前90%数据
h1 = hh[hh['TARGET'] == 1]['DEFAULT_AMT_mean']
h0 = hh[hh['TARGET'] == 0]['DEFAULT_AMT_mean']
h1_80 = h1[h1<= np.percentile(h1, 80)]
h0_80 = h0[h0<= np.percentile(h0, 80)]
hh.describe()

In [None]:
fig, ax = plt.subplots(1, 2, figsize =(15, 8))
#历史发生逾期的用户
sns.distplot(h1_80.values, bins = 30, fit=None, label = '1', color = 'salmon',ax=ax[0])
sns.distplot(h0_80.values, bins = 30,fit=None, label = '0', color = 'cornflowerblue', ax=ax[0])
sns.despine()
ax[0].set_title('History Repayment Record Is Not Null', fontsize = 10)
ax[0].set_xlabel('Default_Amt_Mean', fontsize = 12)
ax[0].set_ylabel('Probability', fontsize = 12)
ax[0].legend(fontsize = 'small')

explode = [0.2, 0]
#历史正常还款和无历史记录的用户
ax[1].pie(h_nan, labels = ['0','1'], explode=explode, autopct = '%1.1f%%', startangle = 90 ,colors =  ['lightsteelblue', 'lightsalmon'], shadow = False )
ax[1].set_title('Normal Repayment or History Repayment Record Is Null', fontsize = 10)
fig.suptitle('Distribution  of  Default_Amt_Mean', fontsize = 15)
plt.show()

上图可见，左图为历史出现过逾期的贷款用户逾期平均金额及违约分布，在历史逾期平均金额偏小时逾期率偏低，随着金额的上升，逾期占比增大。

右图是历史正常还款或者无历史记录的贷款用户的当前逾期分布，可以发现当前逾期率很低，在8%，再次说明历史重视足额还款的贷款人在后续贷款还款时不会轻易发生逾期。

In [None]:
del  ip_agg, default_days_agg, default_amount_agg, tem, h, ip
gc.collect()


### 3-5 逾期期数分布 （Mn）
正常资产用C表示，Mn表示逾期N期，用30天来定义一个M：M1逾期一期，M2逾期二期，M3逾期三期，M4逾期四期，M5逾期五期，M6逾期六期，Mn+表示逾期N期(含)以上，M7+表示逾期期数 >=M7。

In [None]:
pos = pd.read_csv('../input/POS_CASH_balance.csv')

In [None]:

pos['DPD_DAY'] = (pos['SK_DPD'] - pos['SK_DPD_DEF'])
bin = [-1, 0, 30, 60, 90, 120, 150, 180, 210, 100000]
label = ['C', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6','M7', 'M7+']
pos['Default_BINNED'] = pd.cut(pos['DPD_DAY'], bins = bin, labels = label)
i = pos.loc[:, ['SK_ID_CURR', 'SK_ID_PREV', 'Default_BINNED']]
i = pd.get_dummies(i, columns = ['Default_BINNED'])

cat_columns = i.columns.drop(['SK_ID_CURR', 'SK_ID_PREV'])
cat_agg = { }
for a in cat_columns:
    cat_agg[a] = ['mean', 'sum']   

i_agg = i.groupby(['SK_ID_CURR', 'SK_ID_PREV'], as_index = False).agg({**cat_agg})

tem = ['SK_ID_CURR', 'SK_ID_PREV'] 
for i in i_agg.columns.tolist():
     if i[0] != 'SK_ID_CURR' and  i[0] != 'SK_ID_PREV':
        tem.append(i[0] + '_' + i[1])
i_agg.columns = pd.Index(tem)
#计算每个贷款人历史贷款逾期状态（C,M1,M2...）的总次数及每种状态的占比
cat_columns = i_agg.columns.drop(['SK_ID_CURR', 'SK_ID_PREV'])
cat_agg = { }
for a in cat_columns:
    cat_agg[a] = ['mean', 'sum']

i_fin = i_agg.groupby(['SK_ID_CURR'], as_index = False).agg({**cat_agg})

tem = ['SK_ID_CURR'] 
for i in i_fin.columns.tolist():
     if i[0] != 'SK_ID_CURR':
        tem.append(i[0] + '_' + i[1])
i_fin.columns = pd.Index(tem)
#去掉无用列
i_fin.drop(['Default_BINNED_C_sum_mean', 'Default_BINNED_C_mean_sum',
               'Default_BINNED_M1_sum_mean', 'Default_BINNED_M1_mean_sum',
               'Default_BINNED_M2_sum_mean', 'Default_BINNED_M2_mean_sum',
               'Default_BINNED_M3_sum_mean', 'Default_BINNED_M3_mean_sum',
               'Default_BINNED_M4_sum_mean', 'Default_BINNED_M4_mean_sum',
               'Default_BINNED_M5_sum_mean', 'Default_BINNED_M5_mean_sum',
               'Default_BINNED_M6_sum_mean', 'Default_BINNED_M6_mean_sum',
               'Default_BINNED_M7_sum_mean', 'Default_BINNED_M7_mean_sum',
               'Default_BINNED_M7+_sum_mean', 'Default_BINNED_M7+_mean_sum'], axis = 1 , inplace = True)
del  pos
gc.collect()

i = df_train.loc[:, ['SK_ID_CURR','TARGET']].merge(i_fin, how = 'left', on = 'SK_ID_CURR')
ii = i[~i['Default_BINNED_M3_sum_sum'].isnull()]
#计算各个列于TARGET的相关系数
ii.corr().loc[:, 'TARGET'].sort_values()
# Default_BINNED_C_sum_sum       -0.037084
# Default_BINNED_C_mean_mean     -0.011464
# SK_ID_CURR                     -0.002137
# Default_BINNED_M6_sum_sum       0.000428
# Default_BINNED_M4_sum_sum       0.001159
# Default_BINNED_M5_sum_sum       0.001200
# Default_BINNED_M6_mean_mean     0.001674
# Default_BINNED_M4_mean_mean     0.002035
# Default_BINNED_M3_sum_sum       0.002428
# Default_BINNED_M1_sum_sum       0.002449
# Default_BINNED_M2_sum_sum       0.002501
# Default_BINNED_M5_mean_mean     0.002583
# Default_BINNED_M7_sum_sum       0.002700
# Default_BINNED_M3_mean_mean     0.003577
# Default_BINNED_M7_mean_mean     0.003639
# Default_BINNED_M7+_sum_sum      0.003640
# Default_BINNED_M2_mean_mean     0.004266
# Default_BINNED_M7+_mean_mean    0.006016
# Default_BINNED_M1_mean_mean     0.011746
# TARGET                          1.000000
# Name: TARGET, dtype: float64

1.对每个贷款人历史贷款逾期状态（C,M1,M2...）的总次数画分布图，本文将选取相关度较高的C进行展示。

2.对每个贷款人历史贷款逾期状态（C,M1,M2...）的占比画分布图，本文将选取相关度较高的M1进行展示。

### 3-5-1 历史贷款逾期状态（C）总次数分布

In [None]:
i1 = ii[ii['TARGET'] == 1][ii['Default_BINNED_C_sum_sum'] != 0]['Default_BINNED_C_sum_sum']
i0 = ii[ii['TARGET'] == 0][ii['Default_BINNED_C_sum_sum'] != 0]['Default_BINNED_C_sum_sum']
#由于位于C状态为0的数量较多，单独作图
i_pie = ii[ii['Default_BINNED_C_sum_sum'] == 0]['TARGET'].value_counts().tolist()
fig, ax = plt.subplots(1, 2, figsize =(15, 8))
#历史逾期状态C大于0的分布
sns.distplot(i1.values, bins = 80, label = '1', color = 'gold',ax=ax[0])
sns.distplot(i0.values, bins = 80, label = '0', color = 'cornflowerblue',ax=ax[0])
sns.despine()
ax[0].set_xlabel('Default_Binned_C', fontsize = 12)
ax[0].set_ylabel('Probability', fontsize = 12)
ax[0].legend(fontsize = 'large')
ax[0].set_title(' History Default_Binned_C is not 0', fontsize = 10)
#历史逾期状态C等于0的分布

explode = [0.2, 0]
ax[1].pie(i_pie, labels =['0','1'], autopct = '%1.1f%%', explode = explode, startangle = 90 ,colors = ['lightsteelblue', 'navajowhite'], shadow = False )
ax[1].set_title('History Default count is 0', fontsize = 10)
fig.suptitle('Distribution of Default_Binned_C (sum)', fontsize = 15)
plt.show()

上图可见，左图为历史逾期状态C的总数大于0的分布，前段（0-50次）区间内逾期率较高，后期（大于50次）逾期率较低，说明历史多次位于C状态的用户不容易逾期。

右图为历史逾期状态C的总数为0时Target的零一分布，当前的逾期率为16.4%。
### 3-5-2 历史贷款逾期状态（M1）占比分布

In [None]:
i1 = ii[ii['TARGET'] == 1][ii['Default_BINNED_M1_mean_mean']!= 0]['Default_BINNED_M1_mean_mean']
i0 = ii[ii['TARGET'] == 0][ii['Default_BINNED_M1_mean_mean']!= 0]['Default_BINNED_M1_mean_mean']
i_pie = ii[ii['Default_BINNED_M1_mean_mean']== 0]['TARGET'].value_counts().tolist()

fig, ax = plt.subplots(1, 2, figsize =(15, 8))
#历史逾期状态C大于0的分布
sns.distplot(i1.values, bins = 80, label = '1', color = 'salmon', ax=ax[0])
sns.distplot(i0.values, bins = 80, label = '0', color = 'cornflowerblue', ax=ax[0])
sns.despine()
ax[0].set_xlabel('Default_BINNED_M1', fontsize = 12)
ax[0].set_ylabel('Probability', fontsize = 12)
ax[0].legend(fontsize = 'large')
ax[0].set_title(' History Default_Binned_M1 is not 0', fontsize = 10)
#历史逾期状态C等于0的分布
explode = [0.2, 0]
ax[1].pie(i_pie, labels = ['0','1'], explode = explode, autopct = '%1.1f%%', startangle = 90 ,colors = ['lightsteelblue', 'lightsalmon'], shadow = False )
ax[1].set_title('History Default_Binned_M1 is  0', fontsize = 10)
fig.suptitle('Distribution of Default_Binned_M1 (mean)', fontsize = 15)
plt.show()

上图可见，左图为历史逾期状态M1的占比大于0的分布，随着占比的上升，违约率有所上升。其中占比在0.1-0.2的区间内上升比较明显。

右图为历史逾期状态M1的占比为0时Target的零一分布，当前的逾期率为8.1%。
### 四、小结
1.随着贷款用户年龄的增大，违约率呈下降趋势。年龄段在20-25岁违约率最高，超过12%。

2.按职业类型分布，人数较多的职业分别是：未标示人员、工人、销售人员、骨干员工和经理。基本与房贷产品受众对象一致。

按职业类型分布，逾期率较高（超过10%）的职业：服务员/酒吧员工，工人、司机、后厨员工、安保人员。以上职业相对收入比较低，还款能力较弱。

3.受雇佣年限越短，逾期率越高。贷款用户工作年限越短（0-5年），越容易发生逾期。随着工作年限的增长，积累资本趋于稳定，贷款用户逾期率明显下降。

4.随着年收入的增长，逾期比率逐渐下降，在年收入10--20万的区间内违约率较高。

5.债务收入比与违约情况变化并没有显著统一的规律。债务收入比小于2的贷款用户违约率偏低；在2到6的债务收入比区间内，贷款用户的违约率偏高；债务收入比大于6的违约率呈回落趋势； 贷款金额远大于年收入的贷款人毁约率偏低，说明这批贷款用户的信用水平较高，违约风险偏低，能借到更多的资金。

6.从被查询次数来看，查询次数在-1（无查询记录）到6次的人数居多，均大于10000人次。 被查询次数在-1至6的区间内，违约率先降后平缓回升，无查询记录（-1）的人数偏多且毁约率高于10%；在0至6的查询区间内趋于平缓上升，毁约率在8%左右。

7.随着历史贷款次数的上升，贷款人数逐渐减小。历史借款次数范围在0-10内，毁约率呈先下降后回升的趋势，其中历史借了4次款的贷款用户毁约率最低。

8.历史发生逾期的贷款用户中，历史逾期总天数偏小（约1-10天内）时，当前逾期率偏低；处于11-60天内的波动较大，逾期率总体持平。 历史正常还款或者无历史记录的贷款用户中，当前逾期率较低（6.7%），说明历史重视还款时间的贷款人在后续贷款还款时不会轻易发生逾期。

9.历史出现过逾期的贷款用户中，在历史逾期平均金额偏小时逾期率偏低，随着金额的上升，逾期占比增大。 历史正常还款或者无历史记录的贷款用户中，当前逾期率较低（8%），再次说明历史重视足额还款的贷款人在后续贷款还款时不易发生逾期。

10.历史逾期状态C的总数大于0时，C出现次数在0-50次内逾期率较高，C出现次数大于50次是逾期率较低，说明历史多次位于C状态的用户不容易逾期。 历史逾期状态C的总数等于0时，当前的逾期率为16.4%。 历史逾期状态M1的占比大于0时，随着占比的上升，违约率有所上升。其中占比在0.1-0.2的区间内上升比较明显。 历史逾期状态M1的占比为0时，当前的逾期率为8.1%。

### 1-5 查看每列的基本情况

In [None]:
#区分开文本列和数字列
feat_obj = df_all.dtypes[df_all.dtypes == 'object'].index
feat_num = df_all.dtypes[df_all.dtypes != 'object'].index

#### 1-5-1 对文本列进行分析

In [None]:
#看看文本列有多少不同的值（可用train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)）
df_all[feat_obj].apply(pd.Series.nunique, axis = 0)

**发现性别列出现3种值，单独查看值的分布。**

In [None]:
df_all['CODE_GENDER'].value_counts()

发现性别项内有异常值'XNA',用‘NaN’代替

In [None]:
df_all['CODE_GENDER'].replace('XNA', np.nan, inplace = True )

**对文本列做dummy，计算文本列特征与target的相关度。**

In [None]:
feat_obj_dum = pd.get_dummies(df_all[feat_obj], dummy_na = True)
df_all = pd.concat([df_all, feat_obj_dum], axis = 1)
#删除原文本列
df_all.drop(feat_obj,axis = 1, inplace = True)

In [None]:
feat_obj_dum['TARGET'] = df_train['TARGET']

**计算文本列与target的相关系数，选取target列进行排序。**

In [None]:
obj_corr = feat_obj_dum.corr()
obj_corr = obj_corr['TARGET']

In [None]:
abs(obj_corr).sort_values(ascending = False).head(10)

**以上是前10项相关度较高的文本列，可以看到工作类型，性别和教育类型的相关度比较靠前。**

In [None]:
del feat_obj_dum
gc.collect()

#### 1-5-2 对数字列进行分析

In [None]:
df_all.loc[ : , ['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']].describe()

**发现DAYS_EMPLOYED的最大值365243天，约等于1000年，可能是异常值，下面对该列进行分析。**

In [None]:
df_all['DAYS_EMPLOYED'].plot.hist()
plt.xlabel('DAYS_EMPLOYED')

**超过50000个用户的DAYS_EMPLOYED在365243上，进一步分析是否需要剔除。
（计算异常值个数、异常值的逾期率和正常值的逾期率）**

In [None]:
anom = df_train[df_train['DAYS_EMPLOYED'] == 365243]
nom = df_train[df_train['DAYS_EMPLOYED'] != 365243]
prec_anom = 100 * anom['TARGET'].mean()
prec_nom = 100 * nom['TARGET'].mean()

print('number of anomalies :', len(anom))
print('percent of anomalies that default the loans : %0.2f%%' % prec_anom)
print('percent of nomalies that default the loans :  %0.2f%%' % prec_nom)

In [None]:
del anom, nom
gc.collect()

**发现异常值的逾期率低于正常值的逾期率，可以猜测365243并不是表示1000年，而是一个标记特征，所以新增一列DAYS_EMPLOYED_anom，用1表示。**

In [None]:
df_all['DAYS_EMPLOYED_anom'] = df_all['DAYS_EMPLOYED'] == 365243
df_all['DAYS_EMPLOYED'].replace(365243, np.nan, inplace = True)

In [None]:
anom_dum = pd.get_dummies(df_all['DAYS_EMPLOYED_anom'], dummy_na = True)
df_all = pd.concat([df_all, anom_dum], axis = 1)
df_all.drop(['DAYS_EMPLOYED_anom'],axis = 1, inplace = True)

In [None]:
del anom_dum
gc.collect()

**新增DAY_EMPLOYED_PERC，INCOME_CREDIT_PERC，INCOME_PER_PERSON，ANNUITY_INCOME_PERC，PAYMENT_RATE 5列。**

In [None]:
#工作时间占年龄的比率
df_all['DAY_EMPLOYED_PERC'] = df_all['DAYS_EMPLOYED'] / df_all['DAYS_BIRTH']
#总收入占贷款的比率
df_all['INCOME_CREDIT_PERC'] = df_all['AMT_INCOME_TOTAL'] / df_all['AMT_CREDIT']
#该用户家庭的人均收入
df_all['INCOME_PER_PERSON'] = df_all['AMT_INCOME_TOTAL'] / df_all['CNT_FAM_MEMBERS']
#贷款年金占总收入的比率
df_all['ANNUITY_INCOME_PERC'] = df_all['AMT_ANNUITY'] / df_all['AMT_INCOME_TOTAL']
#贷款年金占贷款的比率
df_all['PAYMENT_RATE'] = df_all['AMT_ANNUITY'] / df_all['AMT_CREDIT']

#### 1-5-3 处理outliers

**用温和异常值的公式进行计算。为了少删outliers,把第一分位数Q1和第三分位数Q3的范围由原来的25%和75%变成2%和98%。**

In [None]:
#用train表数据，outliers表示需要删除的行索引
outlier_indices = []
for i in feat_num:
    Q1 = df_train[i].quantile(0.02)
    Q3 = df_train[i].quantile(0.98)
    IQR = Q3 - Q1
    outliers = df_train[(df_train[i] < Q1 - 1.5 * IQR) | (df_train[i] > Q3 + 1.5 * IQR)].index
    outlier_indices.extend(outliers)

In [None]:
#可能存在重复的行索引，先去重
from collections import Counter
outlier_indices = Counter(outlier_indices)#字典形式出现
multiple_outliers = []
for key, values in outlier_indices.items():#字典.items() 函数以列表返回可遍历的(键, 值) 元组数组
    if values > 2:
        multiple_outliers.append(key)  

In [None]:
#需删掉的行数
len(multiple_outliers)
df_all.drop(multiple_outliers, inplace = True)

In [None]:
#train数据集去除outliers后的行列数
df_all.shape

#### 1-5-4 计算数字列特征的相关系数

In [None]:
tem = df_train[feat_num]
tem["TARGET"] = df_train["TARGET"]
num_corr = tem.corr()

In [None]:
abs(num_corr['TARGET']).sort_values(ascending = False)

**发现'EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1'相关系数比较高，选取出现频率较高的值生成对应的dummy列。
余下的数字列中把非重复值大于150个的特征看做连续型，小于等于150的看做离散型并生成对应的dummy列。**

In [None]:
num_value_count = df_all[feat_num].apply(pd.Series.nunique, axis = 0)

In [None]:
feat_num_dum = num_value_count[num_value_count <= 150].index.tolist()
feat_num_not_dum = num_value_count[num_value_count > 150].index.tolist()
print('There are %d feature of num need to get dummy.' % len(feat_num_dum))
print('There are %d feature of num left.' % len(feat_num_not_dum))

In [None]:
df_all = pd.get_dummies(df_all, columns = feat_num_dum, dummy_na = True )

In [None]:
df_all.shape

**单独取出ext_source的数据分析。由于空值较多，影响下面画图效果，需去掉空值行。**

In [None]:
df_train['EXT_SOURCE_1'].value_counts().sort_values(ascending = False)

In [None]:
ext_source = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']
df_ext_source = df_all[ext_source]
for i in ext_source:
    print(df_ext_source[i].isnull().sum())
    
df_ext_source.dropna(inplace = True)
print(df_ext_source.head(20))

In [None]:
plt.figure(figsize = (15, 15))
for i, col in enumerate(ext_source):
    plt.subplot(3, 1, i + 1)
    plt.hist(df_ext_source[col], bins = 5000, color = 'blue')
    plt.title('distribution of %s' % col)
    plt.xlabel('%s' % col)   

**从上图可以看出，3个特征的非空值分布在0—1之间。**

**EXT_SOURCE_1
由于每个值的出现频率都比较小，不做dummy。**

In [None]:
#新增一列EXT_SOURCE_1_null，表示EXT_SOURCE_1是否为空值，空值记为1
df_all['EXT_SOURCE_1_null'] = 0
df_all['EXT_SOURCE_1_null'][df_all['EXT_SOURCE_1'].isnull()] = 1

In [None]:
#验证是否有遗漏
df_all['EXT_SOURCE_1_null'].sum()

**EXT_SOURCE_2**

In [None]:
#新增一列，表示是否为空值。空值记为1
df_all['EXT_SOURCE_2_null'] = 0
df_all['EXT_SOURCE_2_null'][df_all['EXT_SOURCE_2'].isnull()] = 1

In [None]:
#验证是否有遗漏
df_all['EXT_SOURCE_2_null'].sum()

In [None]:
#查看计数量较多的值
tem = df_all['EXT_SOURCE_2'].value_counts().sort_values(ascending = False)

In [None]:
#计数量大于100的值有21个
tem[tem > 100].shape[0]

In [None]:
#计数量大于100的值生成dummy列
for i in tem[tem > 100].index:
    df_all['EXT_SOURCE_2' + str(i)] = 0
    df_all['EXT_SOURCE_2' + str(i)][df_all['EXT_SOURCE_2'] == i] = 1

In [None]:
df_all.shape

**EXT_SOURCE_3**

In [None]:
#新增一列，表示是否为空值。空值记为1
df_all['EXT_SOURCE_3_null'] = 0
df_all['EXT_SOURCE_3_null'][df_all['EXT_SOURCE_3'].isnull()] = 1

In [None]:
#验证是否有遗漏
df_all['EXT_SOURCE_3_null'].sum()

In [None]:
#查看计数量较多的值
tem = df_all['EXT_SOURCE_3'].value_counts().sort_values(ascending = False)

In [None]:
#计数量大于1000的值有48个
tem[tem > 1000].shape[0]

In [None]:
#计数量大于1000的值生成dummy列
for i in tem[tem > 1000].index:
    df_all['EXT_SOURCE_3' + str(i)] = 0
    df_all['EXT_SOURCE_3' + str(i)][df_all['EXT_SOURCE_3'] == i] = 1

In [None]:
df_all.shape

In [None]:
df_all.fillna(-1, inplace = True)

In [None]:
#检验是有全部填好空值
df_all.isnull().sum()

## 二、探索数据（剩余6张表）

#### 2-1 用bureau和bureau_balance表做特征

In [None]:
bureau = pd.read_csv('../input/bureau.csv')
bb = pd.read_csv('../input/bureau_balance.csv')

**通过提前做好one hot后，group by统计离散变量的出现次数占比，出现总次数。
并且统计每个SK_ID_BUREAU的MONTHS_BALANCE的最大值，最小值，计数。**

**同时这种先one hot，再groupby统计特征的min，max，mean等数值的方法贯穿下列各张表。后续各张表的处理过程中不再赘述。**

In [None]:
#定义对dataframe做one hot的函数。其中只对‘object’类型的列做转化。
def one_hot_encoder(df, nan_category = True):
    original_cols = list(df.columns)
    categorial_cols = [col for col in df.columns if df[col].dtypes == 'object']
    df = pd.get_dummies(df, columns = categorial_cols, dummy_na = nan_category)
    new_columns = [i for i in df.columns if i not in original_cols]
    return df, new_columns

In [None]:
bureau.head()

In [None]:
bb.head()

In [None]:
b_obj = bureau.dtypes[bureau.dtypes == 'object'].index
bb_obj = bb.dtypes[bb.dtypes == 'object'].index

In [None]:
bureau[b_obj].apply(pd.Series.nunique, axis = 0)

In [None]:
bb[bb_obj].apply(pd.Series.nunique, axis = 0)

In [None]:
bureau, bureau_cat = one_hot_encoder(bureau)
bb, bb_cat = one_hot_encoder(bb)

In [None]:
bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size'] } #统计最小值，最大值，出现笔数
for i in bb_cat:
    bb_aggregations[i] = ['mean','sum'] #算出占比和总次数

bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
tem = []
for i in bb_agg.columns.tolist():
    tem.append(i[0] + '_' + i[1])
bb_agg.columns = pd.Index(tem)

In [None]:
bb_agg.head()

In [None]:
#按SK_ID_BUREAU连接bureau和bb表
bureau = bureau.join(bb_agg, how= 'left', on='SK_ID_BUREAU')
bureau.drop(['SK_ID_BUREAU'], axis= 1, inplace= True)

In [None]:
del bb_agg, bb
gc.collect()

In [None]:
bureau.head()

In [None]:
#最后统计每个SK_ID_CURR的历史SK_ID_BUREAU信息。对所有数值计算最小值，最大值，均值，方差，总和
tem = bureau.columns.tolist()
num_agg = { }
for i in tem:
    if i != 'SK_ID_CURR':
        num_agg[i] = ['min','max','mean','var','sum']
bureau_agg = bureau.groupby('SK_ID_CURR', as_index = False).agg(num_agg)

tem = [ ]
tem.append('SK_ID_CURR')
for i in bureau_agg.columns.tolist():
    if i[0] != 'SK_ID_CURR':
        tem.append('bureau' + '_' + i[0] + '_' + i[1])
bureau_agg.columns = pd.Index(tem)

In [None]:
bureau_agg.head()

In [None]:
#拼接到df_all表上，记为df_allX
df_allX = df_all.merge(bureau_agg, how = 'left', on= 'SK_ID_CURR')

In [None]:
df_allX.shape

In [None]:
del bureau, bureau_agg
gc.collect()

#### 2-2 用installments_payments表做特征

In [None]:
ip = pd.read_csv('../input/installments_payments.csv')

##### 2-2-1 计算每一笔逾期天数，找出出现逾期的用户及逾期次数

In [None]:
ip.head()

In [None]:
#新增一列DEFAULT_DAY，表示每笔的逾期天数
ip['DEFAULT_DAY'] = ip['DAYS_ENTRY_PAYMENT'] - ip['DAYS_INSTALMENT']
tem = ip[ip['DEFAULT_DAY'] > 0]#逾期天数大于0
tem = tem.loc[: ,['SK_ID_CURR','DEFAULT_DAY']]
#计算每个SK_ID_CURR的逾期总天数和逾期总次数
default_days_agg = tem.groupby('SK_ID_CURR', as_index = False).agg({ 'DEFAULT_DAY' : [ 'sum', 'count']})

print(default_days_agg.head())

In [None]:
tem = []
tem.append('SK_ID_CURR')
for i in default_days_agg.columns.tolist():
    if i[0] != 'SK_ID_CURR':
        tem.append('ip' + '_' + i[0] + '_' + i[1])
default_days_agg.columns = pd.Index(tem)

print(default_days_agg.head())

In [None]:
df_allX = df_allX.merge(default_days_agg, how = 'left', on = 'SK_ID_CURR')

In [None]:
del default_days_agg
gc.collect()

##### 2-2-2 计算每个用户历史借款次数和总分期数

In [None]:
ip['MONEY'] = ip['AMT_INSTALMENT'] - ip['AMT_PAYMENT']
tem = ip.loc[: ,['SK_ID_PREV', 'SK_ID_CURR', 'MONEY']]

In [None]:
tem = tem.groupby(['SK_ID_CURR','SK_ID_PREV'], as_index = False).count().rename(columns={'MONEY': 'TIMES'})

In [None]:
tem = tem.groupby('SK_ID_CURR', as_index = False).agg({'SK_ID_PREV': 'count', 'TIMES': 'sum'})

In [None]:
tem = tem.rename(index = str, columns = {"SK_ID_PREV": "ip_CREDIT_TIMES", "TIMES": "ip_TOTAL_INSTALLMENT_TIMES"})

In [None]:
tem.head()

In [None]:
df_allX = df_allX.merge(tem,how = 'left', on = 'SK_ID_CURR')

In [None]:
df_allX.shape

##### 2-2-3 计算每笔分期的逾期金额

In [None]:
tem = ip.groupby('SK_ID_CURR', as_index = False).agg({'AMT_INSTALMENT':'sum', 'AMT_PAYMENT':'sum'})
tem['ip_default_money'] = tem['AMT_INSTALMENT'] - tem['AMT_PAYMENT']
tem = tem.loc[:,['SK_ID_CURR','ip_default_money']]

In [None]:
df_allX = df_allX.merge(tem, how = 'left', on= 'SK_ID_CURR')

In [None]:
del ip, tem
gc.collect()

#### 2-3 用previous_application表做特征

In [None]:
pa = pd.read_csv('../input/previous_application.csv')

In [None]:
pa.head()

In [None]:
pa, pa_cat = one_hot_encoder(pa)

In [None]:
#新加一列特征：申请金额占与实际发放金额的比值
pa['application_credit_perc'] = pa['AMT_APPLICATION']/ pa['AMT_CREDIT']

In [None]:
cat_agg = { }
for i in pa_cat:
    cat_agg[i] = ['mean', 'sum']
    
num_agg = {
    'AMT_ANNUITY': ['min','max','mean'],
    'AMT_APPLICATION':['min','max','mean'],
    'AMT_CREDIT':['min','max','mean'],
    'AMT_DOWN_PAYMENT':['min','max','mean'],
    'AMT_GOODS_PRICE':['min','max','mean'],
    'application_credit_perc':['min','max','mean','var'],
    'HOUR_APPR_PROCESS_START':['min','max','mean'],
    'RATE_DOWN_PAYMENT':['min','max','mean'],
    'DAYS_DECISION':['min','max','mean'],
    'CNT_PAYMENT':['sum','mean']
}

pa_agg = pa.groupby('SK_ID_CURR', as_index = False).agg({**cat_agg, **num_agg})

In [None]:
pa_agg.head()

In [None]:
tem = []
tem.append('SK_ID_CURR')
for i in pa_agg.columns.tolist():
    if i[0] != 'SK_ID_CURR':
        tem.append(i[0] + '_' + i[1])
pa_agg.columns = pd.Index(tem)

In [None]:
df_allX = df_allX.merge(pa_agg, how = 'left', on = 'SK_ID_CURR')

In [None]:
df_allX.head()

In [None]:
del pa_agg, pa
gc.collect()

#### 2-4 用POS_CASH_balance表做特征

In [None]:
pcb = pd.read_csv('../input/POS_CASH_balance.csv')

**增加STATUS_COMPLETED一列，1表示Contract status = 'Completed'，0为其他状态**

In [None]:
pcb['STATUS_COMPLETED'] = 0
for i in range(pcb['NAME_CONTRACT_STATUS'].shape[0]):
    if pcb['NAME_CONTRACT_STATUS'].values[i] == 'Completed':
        pcb['STATUS_COMPLETED'].values[i] = 1

In [None]:
pcb, pcb_cat = one_hot_encoder(pcb)

In [None]:
pcb_cat_agg = { }
for i in pcb_cat:
    pcb_cat_agg[i] = ['mean','sum']

pcb_num_agg = { 
    'MONTHS_BALANCE': ['max', 'mean', 'size'],
    'SK_DPD': ['max', 'mean'],
    'SK_DPD_DEF': ['max', 'mean'],
    'STATUS_COMPLETED':['sum']
}

In [None]:
pcb_agg = pcb.groupby('SK_ID_CURR', as_index = False).agg({**pcb_cat_agg, **pcb_num_agg})

In [None]:
tem = [ ]
tem.append('SK_ID_CURR')
for i in pcb_agg.columns.tolist():
    if i[0] != 'SK_ID_CURR':
        tem.append('pcb' + '_' + i[0] + '_' + i[1])
pcb_agg.columns = pd.Index(tem)

In [None]:
#新增一列:每个用户的记录数
pcb_agg['pcb_count'] = pcb.groupby('SK_ID_CURR').size()

In [None]:
pcb_agg.head()

**对['STATUS_COMPLETED']做特征：每个SK_ID_CURR的历史借款笔数['pcb_PREV_CREDIT_COUNT']、'Completed'笔数的占比['pcb_COMPLETED_PERC']**

In [None]:
pcb.head()

In [None]:
total_completed = pcb.loc[ : , ['SK_ID_CURR', 'STATUS_COMPLETED']].groupby('SK_ID_CURR', as_index = False).sum()
credt_count = pcb.loc[ : , ['SK_ID_CURR', 'SK_ID_PREV']].groupby(
    'SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_PREV' : 'pcb_PREV_CREDIT_COUNT'})
tem = total_completed.merge(credt_count, how = 'left', on = 'SK_ID_CURR')
tem['pcb_COMPLETED_PERC'] = tem['STATUS_COMPLETED'] / tem['pcb_PREV_CREDIT_COUNT']
print(tem.head())

In [None]:
pcb_agg = pcb_agg.merge(tem, how = 'left', on = 'SK_ID_CURR')

In [None]:
pcb_agg.head()

In [None]:
df_allX = df_allX.merge(pcb_agg, how = 'left', on= 'SK_ID_CURR')

In [None]:
df_allX.head()

In [None]:
del pcb, pcb_agg
gc.collect()

#### 2-5 用credit_card_balance表做特征

In [None]:
ccb = pd.read_csv('../input/credit_card_balance.csv')

**增加['STATUS_COMPLETED']一列，1表示Contract status = 'Completed'，0为其他状态**

In [None]:
ccb['STATUS_COMPLETED'] = 0
for i in range(ccb['NAME_CONTRACT_STATUS'].shape[0]):
    if ccb['NAME_CONTRACT_STATUS'].values[i] == 'Completed':
        ccb['STATUS_COMPLETED'].values[i] = 1

In [None]:
ccb, ccb_cat = one_hot_encoder(ccb)

In [None]:
ccb_cat_agg = { }
for i in ccb_cat:
    ccb_cat_agg[i] = ['mean','sum']
    
ccb_num_agg = { 
    'MONTHS_BALANCE': ['max', 'mean', 'size'],
    'SK_DPD': ['max', 'mean'],
    'SK_DPD_DEF': ['max', 'mean'],
    'STATUS_COMPLETED':['sum']
}

ccb_num = ['AMT_BALANCE',
       'AMT_CREDIT_LIMIT_ACTUAL', 'AMT_DRAWINGS_ATM_CURRENT',
       'AMT_DRAWINGS_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT',
       'AMT_DRAWINGS_POS_CURRENT', 'AMT_INST_MIN_REGULARITY',
       'AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT',
       'AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE',
       'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_CURRENT',
       'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
       'CNT_INSTALMENT_MATURE_CUM']
for i in ccb_num:
    ccb_num_agg[i] = ['min', 'max', 'mean', 'sum', 'var']

In [None]:
ccb_agg = ccb.groupby('SK_ID_CURR', as_index = False).agg({**ccb_cat_agg, **ccb_num_agg})

In [None]:
tem = [ ]
tem.append('SK_ID_CURR')
for i in ccb_agg.columns.tolist():
    if i[0] != 'SK_ID_CURR':
        tem.append('ccb' + '_' + i[0] + '_' + i[1])
ccb_agg.columns = pd.Index(tem)

In [None]:
ccb_agg['ccb_count'] = ccb.groupby('SK_ID_CURR').size()

**对['STATUS_COMPLETED']做特征：每个SK_ID_CURR的借款笔数['ccb_PREV_CREDIT_COUNT']、'Completed'笔数的占比['ccb_COMPLETED_PERC']**

In [None]:
total_completed = ccb.loc[ : , ['SK_ID_CURR', 'STATUS_COMPLETED']].groupby('SK_ID_CURR', as_index = False).sum()
credt_count = ccb.loc[ : , ['SK_ID_CURR', 'SK_ID_PREV']].groupby(
    'SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_PREV' : 'ccb_PREV_CREDIT_COUNT'})
tem = total_completed.merge(credt_count, how = 'left', on = 'SK_ID_CURR')
tem['ccb_COMPLETED_PERC'] = tem['STATUS_COMPLETED'] / tem['ccb_PREV_CREDIT_COUNT']
print(tem.head())

In [None]:
ccb_agg = ccb_agg.merge(tem, how = 'left',on = 'SK_ID_CURR')

In [None]:
ccb_agg.head()

In [None]:
df_allX = df_allX.merge(ccb_agg,how = 'left', on= 'SK_ID_CURR')

In [None]:
df_allX.head()

In [None]:
del ccb, ccb_agg, total_completed, credt_count ,tem
gc.collect()

#### 2-6 填空缺值

In [None]:
df_allX.fillna(0, inplace = True)

In [None]:
#取出去掉outliers行后的target
df_target_fin = df_train.drop(multiple_outliers)[['TARGET']].copy()

In [None]:
print(df_target_fin.shape)

del df_train
gc.collect()

目前**df_allX**有**353683**行，**df_test**有**48744**行，**df_train**有**304939**行

In [None]:
df_allX.drop('SK_ID_CURR',axis = 1, inplace = True )
df_train_fin = df_allX.loc[ : df_target_fin.shape[0]-1, : ].copy()
df_test_fin = df_allX.loc[df_target_fin.shape[0] : , : ].copy()
print(df_train_fin.shape,df_test_fin.shape)

del df_allX
gc.collect()

## 三、生成建模需用的表格

**因为内存不够，无法接着跑完模型，生成**df_train_fin **和 **df_target_fin **两张CSV备用，作为第二篇笔记进行建模的数据。**

df_train_fin.to_csv('df_train_fin.csv',encoding = 'utf-8',index = 0)
df_test_fin.to_csv('df_test_fin.csv',encoding = 'utf-8',index = 0)
df_target_fin.to_csv('df_target_fin.csv',encoding = 'utf-8',index = 0)

## 四、训练模型，本地验证模型效果，预测测试数据

**需要用到4张表格：df_test（原来的test表）、df_train_fin、 df_test_fin、 df_target_fin**

X = df_train_fin.values
y = df_target_fin['TARGET'].values

del df_train_fin, df_target_fin
gc.collect()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 5)
del X, y
gc.collect()

model = xgboost.XGBClassifier(n_estimatores = 200, max_depth = 8, subsample = 0.8, colsample_bytree = 0.8, 
                              min_child_weight = 50, random_state=27).fit(X_train, y_train)

**#生成train数据放进模型后的答案**
ans = model.predict_proba(X_test)
roc_auc_score(y_test, ans[ : , 1]) 

**#把test数据放进模型**
ans_test = model.predict_proba(df_test_fin.values)
print(ans_test)
**#生成答案**
df_test_fin['TARGET'] = ans_test[:,1]
df_new = df_test_fin.loc[:,['SK_ID_CURR','TARGET']].reset_index()
df_new['SK_ID_CURR'] = df_test['SK_ID_CURR']#用原本的test表
df_new[['SK_ID_CURR','TARGET']].to_csv('05.csv',encoding = 'utf-8',index = 0)

## 五、最后分数

**Private Score：**0.78042.   **Public Score:**0.78267.