# 使用决策树进行个人信用风险评估

使用sklearn中的DecisionTreeClassifier算法来对德国贷款数据建立决策树模型

# 目录
[1 数据源](#1)<br>
[2 数据探索和预处理](#2)<br>
[3 划分训练集和测试集](#3)<br>
[4 模型训练](#4)<br>
[5 模型性能评估](#5)<br>
[6 模型性能提升](#6)<br>

<div id="1"></div>

# 1 数据源
使用UCI上的德国信用数据集。该数据集包含了1000个贷款信息，每一个贷款有20个自变量和一个类变量记录该笔贷款是否违约。
我们将使用该数据集构建模型来预测贷款是否违约。

<div id="2"></div>

# 2 数据探索和预处理

In [231]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

credit = pd.read_csv("data/german_credit.csv")
credit.head(2)

Unnamed: 0,default,account_check_status,duration_in_month,credit_history,purpose,credit_amount,savings,present_emp_since,installment_as_income_perc,personal_status_sex,...,present_res_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
0,0,< 0 DM,6,critical account/ other credits existing (not ...,domestic appliances,1169,unknown/ no savings account,.. >= 7 years,4,male : single,...,4,real estate,67,none,own,2,skilled employee / official,1,"yes, registered under the customers name",yes
1,1,0 <= ... < 200 DM,48,existing credits paid back duly till now,domestic appliances,5951,... < 100 DM,1 <= ... < 4 years,2,female : divorced/separated/married,...,2,real estate,22,none,own,1,skilled employee / official,1,none,yes


In [232]:
credit.shape

(1000, 21)

该数据集包含1000个样本和21个变量。其中default表示信用好坏，其余为特征变量

使用value_counts()函数对支票余额变量account_check_status和储蓄账户余额变量savings_balance进行查看。

In [233]:
credit.job.value_counts()

skilled employee / official                                      630
unskilled - resident                                             200
management/ self-employed/ highly qualified employee/ officer    148
unemployed/ unskilled - non-resident                              22
Name: job, dtype: int64

In [234]:
credit.savings.value_counts()

... < 100 DM                   603
unknown/ no savings account    183
100 <= ... < 500 DM            103
500 <= ... < 1000 DM            63
.. >= 1000 DM                   48
Name: savings, dtype: int64

上述两个变量的单位都是德国马克（Deutsche Mark, DM）。 直观来看，支票余额和储蓄账户余额越大，贷款违约的可能性越小。

该贷款数据集还有一些数值型变量，例如贷款期限（duration_in_month）和贷款申请额度（credit_amount）。

##### 我们需要把字符变量转换为数字

查看哪些列是字符串类型的

In [235]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
default                       1000 non-null int64
account_check_status          1000 non-null object
duration_in_month             1000 non-null int64
credit_history                1000 non-null object
purpose                       1000 non-null object
credit_amount                 1000 non-null int64
savings                       1000 non-null object
present_emp_since             1000 non-null object
installment_as_income_perc    1000 non-null int64
personal_status_sex           1000 non-null object
other_debtors                 1000 non-null object
present_res_since             1000 non-null int64
property                      1000 non-null object
age                           1000 non-null int64
other_installment_plans       1000 non-null object
housing                       1000 non-null object
credits_this_bank             1000 non-null int64
job                           1000

查看可取哪些值

In [247]:
credit.account_check_status.value_counts()

0    394
1    274
2    269
3     63
Name: account_check_status, dtype: int64

In [237]:
#取字符串的数组
cols = ['account_check_status','credit_history', 'purpose', 'savings', 'present_emp_since','personal_status_sex', 
        'other_debtors','property','other_installment_plans','housing','job','telephone','foreign_worker']
#映射字典
col_dicts = {}
col_dicts = {
  'account_check_status': {
    '0 <= ... < 200 DM': 2,
    '< 0 DM': 1,
    '>= 200 DM / salary assignments for at least 1 year': 3,
    'no checking account': 0
  },
             
  'credit_history': {
      'existing credits paid back duly till now': 0,
      'critical account/ other credits existing (not at this bank)': 1,
      'delay in paying off in the past': 2,
      'all credits at this bank paid back duly': 3,
      'no credits taken/ all credits paid back duly': 4
  },
           
  'purpose': {
      'domestic appliances': 0,
      'car (new)': 1,
      'radio/television': 2,
      'car (used)': 3,
      'business': 4,
      '(vacation - does not exist?)': 5,
      'education': 6,
      'furniture/equipment': 7,
      'repairs': 8,
      'retraining': 9
  },
                
 'savings': {'... < 100 DM': 0,
  'unknown/ no savings account': 1,
  '100 <= ... < 500 DM': 2,
  '500 <= ... < 1000 DM': 3,
  '.. >= 1000 DM': 4},
  
 'present_emp_since':{
     '1 <= ... < 4 years':0,
     '.. >= 7 years':1,
     '4 <= ... < 7 years':2,
     '... < 1 year':3,
     'unemployed':4
 },
 
'other_installment_plans':{
    'none':0,
    'bank':1,
    'stores':2
},
             
 'foreign_worker': {'no': 1, 'yes': 0},
             
 'housing': {'for free': 1, 'own': 0, 'rent': 2},
                               
 'job': {'skilled employee / official': 0,
  'unskilled - resident': 1,
  'management/ self-employed/ highly qualified employee/ officer': 2,
  'unemployed/ unskilled - non-resident': 3},
             
 'other_debtors': {'co-applicant': 2, 'guarantor': 1, 'none': 0},
             
 'personal_status_sex': {'male : single': 0,
  'female : divorced/separated/married': 1,
  'male : married/widowed': 2,
  'male : divorced/separated': 3},
           
 'property': {'if not A121/A122 : car or other, not in attribute 6': 0,
  'real estate': 1,
  'if not A121 : building society savings agreement/ life insurance': 2,
  'unknown / no property': 3},
             
 'telephone': {'none': 0, 'yes, registered under the customers name': 1}}

for col in cols:
    credit[col] = credit[col].map(lambda x: x.strip())
    credit[col] = credit[col].map(col_dicts[col])
    
credit.head(5)

Unnamed: 0,default,account_check_status,duration_in_month,credit_history,purpose,credit_amount,savings,present_emp_since,installment_as_income_perc,personal_status_sex,...,present_res_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
0,0,1,6,1,0,1169,1,1,4,0,...,4,1,67,0,0,2,0,1,1,0
1,1,2,48,0,0,5951,0,0,2,1,...,2,1,22,0,0,1,0,1,0,0
2,0,0,12,1,5,2096,0,2,2,0,...,3,1,49,0,0,1,1,2,0,0
3,0,1,42,0,2,7882,0,2,2,0,...,4,2,45,0,1,1,0,2,0,0
4,1,1,24,2,1,4870,0,0,3,0,...,4,3,53,0,1,2,0,2,0,0


In [238]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
default                       1000 non-null int64
account_check_status          1000 non-null int64
duration_in_month             1000 non-null int64
credit_history                1000 non-null int64
purpose                       1000 non-null int64
credit_amount                 1000 non-null int64
savings                       1000 non-null int64
present_emp_since             1000 non-null int64
installment_as_income_perc    1000 non-null int64
personal_status_sex           1000 non-null int64
other_debtors                 1000 non-null int64
present_res_since             1000 non-null int64
property                      1000 non-null int64
age                           1000 non-null int64
other_installment_plans       1000 non-null int64
housing                       1000 non-null int64
credits_this_bank             1000 non-null int64
job                           1000 non-null 

<div id="3"></div>

# 3 划分训练集和测试集
在正式建模之前，我们需要将数据集分为训练集和测试集两部分。其中训练集用来构建决策树模型，测试集用来评估模型性能。
我们将使用70%数据作为训练数据，30%作为训练数据。

In [239]:
from sklearn import model_selection

y = credit['default']
X  = credit.loc[:,'account_check_status':'foreign_worker']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=1)

我们验证一下训练集和测试集中，违约贷款的比例是否接近。

In [240]:
print (y_train.value_counts()/len(y_train))
print (y_test.value_counts()/len(y_test))

0    0.694286
1    0.305714
Name: default, dtype: float64
0    0.713333
1    0.286667
Name: default, dtype: float64


可见，训练集和测试集中违约贷款比例均接近30%。

<div id="4"></div>

# 4 模型训练
我们将使用Scikit-learn中的DecisionTreeClassifier算法来训练决策树模型。
DecisionTreeClassifier算法位于sklearn.tree包，首先将其导入，然后调用`fit()`方法进行模型训练。

In [241]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
credit_model = DecisionTreeClassifier(min_samples_leaf = 6,random_state=1)
credit_model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=6, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

<div id="5"></div>

# 5 模型性能评估
为了将我们训练好的决策树模型应用于测试数据，我们使用`predict()`函数，代码如下：

In [242]:
credit_pred = credit_model.predict(X_test)

现在，我们得到了决策树模型在测试数据上的预测结果，通过将预测结果和真实结果进行对比可以评估模型性能。
可以使用`sklearn.metrics`包中的`classification_report()`和`confusion_matrix()`函数，展示模型分类结果：

In [243]:
from sklearn import metrics
print (metrics.classification_report(y_test, credit_pred))
print (metrics.confusion_matrix(y_test, credit_pred_cost))

              precision    recall  f1-score   support

           0       0.77      0.83      0.80       214
           1       0.48      0.40      0.43        86

    accuracy                           0.70       300
   macro avg       0.63      0.61      0.62       300
weighted avg       0.69      0.70      0.69       300

[[116  98]
 [ 13  73]]


在300个贷款申请测试数据中，模型的预测正确率（Accuracy）为71.7%。 214个未违约贷款中，模型正确预测了80%。 86个违约贷款中，模型正确预测出了43%。 下面，我们看看是否能够进一步改善模型的性能。

<div id="6"></div>

# 6 模型性能提升
在实际应用中，模型的预测正确率不高，很难将其应用到实时的信贷评审过程。
在本案例中，如果一个模型将所有的贷款都预测为“未违约”，此时模型的正确率将为72%，而该模型是一个完全无用的模型。
上节中我们建立的模型，正确率为70.7%，但是对于违约贷款的识别性能很差。
我们可以通过创建一个代价矩阵定义模型犯不同错误时的代价。
假设我们认为一个贷款违约者给银行带来的损失是银行错过一个不违约的贷款带来损失的4倍，则未违约和违约的代价权重可以定义为：

In [246]:
class_weights = {0:1, 1:4}
credit_model_cost = DecisionTreeClassifier(max_depth=6,class_weight = class_weights)
credit_model_cost.fit(X_train, y_train)
credit_pred_cost = credit_model_cost.predict(X_test)

print (metrics.classification_report(y_test, credit_pred_cost))
print (metrics.confusion_matrix(y_test, credit_pred_cost))


              precision    recall  f1-score   support

           0       0.90      0.53      0.67       214
           1       0.43      0.86      0.57        86

    accuracy                           0.63       300
   macro avg       0.67      0.70      0.62       300
weighted avg       0.77      0.63      0.64       300

[[114 100]
 [ 12  74]]


可见，模型的整体正确率下降为63%，但是此时的模型能将86个违约贷款中的74个正确识别，识别率为86%。