# 1. 问题背景
银行在市场经济中起着至关重要的作用，决定谁可以获得资金，以什么条件获得资金，并可以做出或破坏投资决策。为了使市场和社会发挥作用，个人和公司需要获得信贷。信用评分算法对违约概率进行猜测，是银行用来确定是否应发放贷款的方法。训练一个模型预测未来两年某人将遭遇财务困境的可能性，提高信用评分的最新水平。

# 2. 获取数据
**导包**

In [None]:
# 数据整理和分析
import pandas as pd
import numpy as np
import random as rnd
import os, datetime, sys, random, time
import seaborn as sns
import xgboost as xgs
import lightgbm as lgb
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
plt.style.use('fivethirtyeight')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy import stats, special
import shap

**读取数据**

In [None]:
train_df = pd.read_csv('../input/GiveMeSomeCredit/cs-training.csv')
test_df = pd.read_csv('../input/GiveMeSomeCredit/cs-test.csv')
train_data = train_df.copy()
test_data = test_df.copy()

# 3. 探索性数据分析（EDA）
## 3.1 描述性数据分析

**数据集中包含的特征**

In [None]:
print(train_df.columns.values)

| 字段名 | 含义 | 值 |
| :-----| ----: | :----: |
| SeriousDlqin2yrs | 好客户或坏客户 | {0,1} |
| RevolvingUtilizationOfUnsecuredLines | 信用卡和个人信用额度的总余额除以信用额度之和 | percentage |
| age | 借款人借款时的年龄 | integer |
| NumberOfTime30-59DaysPastDueNotWorse | 借款人逾期30-59天的次数，但在过去2年中没有更糟的次数。 | integer |
| DebtRatio | 每月债务、赡养费、生活费除以每月总收入 | percentage |
| MonthlyIncome | 月收入 | real |
| NumberOfOpenCreditLinesAndLoans | 未结贷款数量和信用额度 | integer |
| NumberOfTimes90DaysLate | 借款人逾期90天或以上的次数 | integer |
| NumberRealEstateLoansOrLines | 抵押和房地产贷款数量，包括房屋净值信贷额度 | integer |
| NumberOfTime60-89DaysPastDueNotWorse | 借款人逾期60-89天的次数，但在过去2年中没有更糟。 | integer |
| NumberOfDependents | 家庭中不包括自己的受抚养人人数 | integer |

In [None]:
train_df.describe()

In [None]:
test_df.describe()

根据上面可以看出，Unnamed:0 表示客户的ID,对模型的构建没有意义，所以将Unnamed:0去除

In [None]:
trainID = train_data['Unnamed: 0']
testID = test_data['Unnamed: 0']
train_data.drop('Unnamed: 0', axis=1, inplace=True)
test_data.drop('Unnamed: 0', axis=1, inplace=True)

**根据特征的值对特征进行分类**

分类型特征 | 数值型特征
-|:-:
SeriousDlqin2yrs|RevolvingUtilizationOfUnsecuredLines
|age
|NumberOfTime30-59DaysPastDueNotWorse
|DebtRatio
|MonthlyIncome
|NumberOfOpenCreditLinesAndLoans
|NumberOfTimes90DaysLate
|NumberRealEstateLoansOrLines
|NumberOfTime60-89DaysPastDueNotWorse
|NumberOfDependents


**检查特征是否具有缺失值**

存在缺失值需要在后续步骤中进行处理

* 训练集:MonthlyIncome, NumberOfDependents
* 测试集:MonthlyIncome, NumberOfDependents

测试集中的SeriousDlqin2yrs为预测的内容，所以将该列删除

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

In [None]:
test_data.drop('SeriousDlqin2yrs', axis=1, inplace = True)

**检查数据的不平衡率**

因为总共有15万的数据，很可能是一个不平衡的数据集。因此需要检查好客户的比率与坏客户的比率。

In [None]:
fig, axes = plt.subplots(1,2,figsize=(12,6))
train_data['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=axes[0])
axes[0].set_title('SeriousDlqin2yrs')
sns.countplot('SeriousDlqin2yrs',data=train_data,ax=axes[1])
axes[1].set_title('SeriousDlqin2yrs')
plt.show()

由上表可知，我们的数据集是高度不平衡的，这样的数据会让监督学习算法更多的学习到多数类，使分类性能下降

**离群值分析**

In [None]:
fig = plt.figure(figsize=[20,140])
for col,i in zip(train_data.columns,range(1,12)):
    axes = fig.add_subplot(14,1,i)
    sns.regplot(train_data[col],train_data.SeriousDlqin2yrs,ax=axes)
plt.show()

从上表可以看出：

* 特征NumberOfTime30-59DaysPastDueNotWorse，NumberOfTime60-89DaysPastDueNotWorse和NumberOfTimes90DaysLate中次数都超过了90

* DebtRatio 和 RevolvingUtilizationOfUnsecuredLines有不合理的大数值

接下来看一下NumberOfTime30-59DaysPastDueNotWorse，NumberOfTime60-89DaysPastDueNotWorse和NumberOfTimes90DaysLat的箱型图

In [None]:
plt.figure(figsize=(19, 12)) 
train_data[['NumberOfTime30-59DaysPastDueNotWorse']].boxplot()
plt.show()

In [None]:
plt.figure(figsize=(19, 12)) 
train_data[['NumberOfTime60-89DaysPastDueNotWorse']].boxplot()
plt.show()

In [None]:
plt.figure(figsize=(19, 12)) 
train_data[['NumberOfTimes90DaysLate']].boxplot()
plt.show()

根据箱型图得出了和上述相同的结论，所以接下来要对离群值进行处理

(1)对NumberOfTime30-59DaysPastDueNotWorse，NumberOfTime60-89DaysPastDueNotWorse和NumberOfTimes90DaysLate进行处理

In [None]:
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))

print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))

print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))

print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))

print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))

print("Proportion of positive class with special 96/98 values:",
      round(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs'].sum()*100/
      len(train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs']),2),'%')

从上面的结果可以得到，当列'NumberOfTime30-59DaysPastDueNotWorse'的值超过90时，其他逾期次数的列也具有相同的值，我们将这些分类为特殊的标签，因为标签为1的比例异常高，达到54.65%。这些96和98值可以被视为会计误差，因此我们将它们替换为96之前的最大值，即13、11和17

In [None]:
train_data.loc[train_data['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
train_data.loc[train_data['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
train_data.loc[train_data['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 17

In [None]:
print("Unique values in 30-59Days", np.unique(train_data['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(train_data['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(train_data['NumberOfTimes90DaysLate']))

上面的是对训练集进行处理，接下来对测试集进行同样的处理

In [None]:
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(test_data[test_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                          ['NumberOfTime30-59DaysPastDueNotWorse']))


print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test_data[test_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                       ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test_data[test_data['NumberOfTime30-59DaysPastDueNotWorse']>=90]
                                                                                                    ['NumberOfTimes90DaysLate']))


print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(test_data[test_data['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                           ['NumberOfTime60-89DaysPastDueNotWorse']))


print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(test_data[test_data['NumberOfTime30-59DaysPastDueNotWorse']<90]
                                                                                        ['NumberOfTimes90DaysLate']))

这些96和98值可以被视为会计误差，因此我们将它们替换为96之前的最大值，即19、9和18

In [None]:
test_data.loc[test_data['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 19
test_data.loc[test_data['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 9
test_data.loc[test_data['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 18

print("Unique values in 30-59Days", np.unique(test_data['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(test_data['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(test_data['NumberOfTimes90DaysLate']))

再次生成训练集和测试集NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, NumberOfTimes90DaysLate的箱型图

In [None]:
plt.figure(figsize=(19, 12)) 
train_data[['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].boxplot()
plt.show()

In [None]:
plt.figure(figsize=(19, 12)) 
test_data[['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']].boxplot()
plt.show()

(2)调整DebtRatio 和 RevolvingUtilizationOfUnsecuredLines

首先对DebtRatio进行分析

In [None]:
print('Debt Ratio: \n',train_data['DebtRatio'].describe())
print('\nRevolving Utilization of Unsecured Lines: \n',train_data['RevolvingUtilizationOfUnsecuredLines'].describe())

从上表可以看出第三四分位数与最大值之间有巨大的差距，接下来对这一现象进行更深的探索

In [None]:
quantiles = [0.75,0.8,0.81,0.85,0.9,0.95,0.975,0.99]
for i in quantiles:
    print(i*100,'% quantile of debt ratio is: ',train_data.DebtRatio.quantile(i))

从上面输出的结果可知，在81%的分位数之后有一个巨大的提升，所以接下来的主要目标是对81%分位数以上的离群值进行分析和处理

In [None]:
train_data[train_data['DebtRatio'] >= train_data['DebtRatio'].quantile(0.95)][['SeriousDlqin2yrs','MonthlyIncome']].describe()

通过上表我们可以发现：

* 在7501名负债率大于95%的客户中，只有379名具有月收入值

* 月收入的最大值为1，最小值为0，可能是数据错误，继续检查两年内的逾期与否与月收入值是否相等

In [None]:
train_data[(train_data["DebtRatio"] > train_data["DebtRatio"].quantile(0.95)) & (train_data['SeriousDlqin2yrs'] == train_data['MonthlyIncome'])]


从上表中可以看出，379个数据中有331个数据的月收入等于逾期与否。因此，我们删除这331个异常值，这样因为错误输入产生的无效数据对我们的预测建模没有用处，并且会增加偏差和方差。

In [None]:
train_data = train_data[-((train_data["DebtRatio"] > train_data["DebtRatio"].quantile(0.95)) & (train_data['SeriousDlqin2yrs'] == train_data['MonthlyIncome']))]
train_data

接下来对RevolvingUtilizationOfUnsecuredLines进行分析

观察RevolvingUtilizationOfUnsecuredLines大于10的数据

In [None]:
train_data[train_data['RevolvingUtilizationOfUnsecuredLines']>10].describe()


观察RevolvingUtilizationOfUnsecuredLines大于13的数据

In [None]:
train_data[train_data['RevolvingUtilizationOfUnsecuredLines']>13].describe()

上表的238条数据会给最终预测增加巨大的误差，所以从数据集中删除这些值

In [None]:
train_data = train_data[train_data['RevolvingUtilizationOfUnsecuredLines']<=13]
train_data

**缺失值处理**

* 根据上面的数据可知，MonthlyIncome 是一个integer类型的数值，但是存在一些缺失值，所以这里使用中值代替空值
* Number of Dependents的缺失值使用0代替，即没有抚养人的意思是抚养人的数量为0

In [None]:
def NullController(df):
    DataMissing = df.isnull().sum()*100/len(df)
    DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing})
    DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True)
    return DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0]

NullController(train_data)

MonthlyIncome和NumberOfDependent的空值所占比率分别为19.76%和2.61%

按照上述方案处理缺失值

In [None]:
train_data['MonthlyIncome'].fillna(train_data['MonthlyIncome'].median(), inplace=True)
train_data['NumberOfDependents'].fillna(0, inplace = True)

再次检查缺失值

In [None]:
NullController(train_data)

对测试集做相同的操作

In [None]:
NullController(test_data)

MonthlyIncome和NumberOfDependents的缺失值所占比分别为19.8%和2.59%

In [None]:
test_data['MonthlyIncome'].fillna(test_data['MonthlyIncome'].median(), inplace=True)
test_data['NumberOfDependents'].fillna(0, inplace = True)

再次检查缺失值

In [None]:
NullController(test_data)

In [None]:
print(train_data.shape)
print(test_data.shape)

**相关性观察**

In [None]:
corr = train_data.corr()
plt.figure(figsize=(19, 15))
sns.heatmap(corr, annot=True, fmt='.2g')

**偏度检查与Box-Cox变换**

首先观察每个值的分布

In [None]:
labels = train_data['SeriousDlqin2yrs']
train_data.drop('SeriousDlqin2yrs', axis = 1 , inplace = True)
all_data = pd.concat([train_data, test_data])
all_data.shape

In [None]:
columnList = list(all_data.columns)
columnList

fig = plt.figure(figsize=[20,20])
for col,i in zip(columnList,range(1,19)):
    axes = fig.add_subplot(6,3,i)
    sns.distplot(all_data[col],ax=axes, kde_kws={'bw':1.5}, color='purple')
plt.show()

由以上数据可以看出来，除了年龄接近正态分布，别的数据在两个方向上都存在偏差，检查每列的偏度值

In [None]:
def SkewMeasure(df):
    nonObjectColList = df.dtypes[df.dtypes != 'object'].index
    skewM = df[nonObjectColList].apply(lambda x: stats.skew(x.dropna())).sort_values(ascending = False)
    skewM=pd.DataFrame({'skew':skewM})
    return skewM[abs(skewM)>0.5].dropna()

skewM = SkewMeasure(all_data)
skewM

通过上表可以发现很多列的偏度都非常高，采用λ=-1的Box-Cox变换，以减少这种偏度

In [None]:
for i in skewM.index:
    all_data[i] = special.boxcox1p(all_data[i],-1)
    
SkewMeasure(all_data)

由于应用了Box-Cox变换，偏度降低了,再次检查各个列的分布图

In [None]:
columnList = list(all_data.columns)
columnList

fig = plt.figure(figsize=[20,20])
for col,i in zip(columnList,range(1,19)):
    axes = fig.add_subplot(6,3,i)
    sns.distplot(all_data[col],ax=axes, kde_kws={'bw':1.5}, color='purple')
plt.show()

# 4. 模型训练

这里采用成熟的ML训练框架automl

In [None]:
!pip install -q -U git+https://github.com/mljar/mljar-supervised.git@master

In [None]:
from supervised.automl import AutoML

In [None]:
trainDF = all_data[:len(train_data)]
testDF = all_data[len(train_data):]
print(trainDF.shape)
print(testDF.shape)
trainDF

In [None]:
automl = AutoML(
    mode="Compete", 
    eval_metric="auc",
    total_time_limit=3000,
    features_selection=False # switch off feature selection
)

In [None]:
x_cols = trainDF.columns[:].tolist()
x_cols

In [None]:
automl.fit(trainDF[x_cols], labels)

In [None]:
preds = automl.predict_proba(testDF[x_cols])

In [None]:
automl.report()