在现实生活中，借钱进行周转是经常发生的事情。比如说，购房时可以申请房贷，购车时可以申请车贷，以及目前非常火热的P2P融资等都是借贷交易。那么，金融机构会将贷款发放给哪些申请人呢？是按照什么逻辑来发放贷款的呢？在这里将会分析一下，什么样的申请人更容易获得贷款审批。

本次使用的数据，依然来源于Analiytics Vidhya的竞赛题目，这是莫银行采集的房贷申请数据，希望通过本次分析，能够发现银行批准房贷的规律。将数据下载到本地后，首先导入类库，并对读入数据。

In [63]:
import pandas as pd
import numpy as np

# 读入数据
df = pd.read_csv('data/loan.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [64]:
df.shape

(614, 13)

数据集中总共有614条记录，包含13个数据项目。其中Loan_Status是本次借贷申请是否被批准。其他数据项如下：
<table align="left">
    <tr><td>Loan_ID</td><td>唯一的借贷申请编号</td></tr>
    <tr><td>Gender</td><td>性别</td></tr>
    <tr><td>Married</td><td>是否结婚</td></tr>
    <tr><td>Dependents</td><td>需要抚养人的个数</td></tr>
    <tr><td>Education</td><td>申请人的受教育状况</td></tr>
    <tr><td>Self_Employed</td><td>是否是自我雇佣者</td></tr>
    <tr><td>ApplicantIncome</td><td>申请人的收入</td></tr>
    <tr><td>CoapplicantIncome</td><td>共同申请人的收入</td></tr>
    <tr><td>LoanAmount</td><td>申请金额</td></tr>
    <tr><td>Loan_Amount_Term</td><td>偿还周期</td></tr>
    <tr><td>Credit_History</td><td>曾经的信用状况是否符合要求</td></tr>
    <tr><td>Property_Area</td><td>产权所属区域</td></tr>
    <tr><td>Loan_Status</td><td>是否批准该申请</td></tr>
</table>

接下来查看一下数据中是否存在缺失值，并制定缺失值的处理方案。

In [65]:
df.isnull().any()

Loan_ID              False
Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

通过结果来看，大部分字段都存在数据缺失的现象。首先将含有缺失想的数据删除，看一下剩余数据有多少？再决定是否需要对缺失值进行填充。

In [66]:
new_df = df.dropna()
new_df.shape

(480, 13)

删除含缺失项的数据之后，还有480条记录。对目前的分析来说数据量还可以，就在这480条记录上进行分析。

首先，简单的想象一下，哪些人更容易获得贷款。直觉上来说，收入高、受教育程度高、并且需要抚养的人少容易获取贷款。另外，历史信用好，所购房车地角好，贷款额度小应该也相对容易。那么，就分析一下看看，直觉是否正确。

第一个观察，受教育程度对贷款审批结果的影响。

In [69]:
data = new_df.groupby(['Loan_Status', 'Education',]).size()
# print(data)
loan_status_list = ['Y', 'N']
education_status = ['Graduate', 'Not Graduate']
show_datas = []
for status in loan_status_list:
    tmp = data[status]
    values = [tmp[education] for education in education_status]
    total = sum(values)
    ratios = [round(value / total, 2) for value in values]
    show_datas.append((status, education_status, ratios))

In [70]:
from pyecharts import Line
line = Line('Education')
i = 0
for name, attrs, values in show_datas:
    line.add(name, attrs, values, is_fill=True, area_opacity=0.4 + 0.1 * i, is_smooth=True,)
    i += 1
line

通过图表可以很明显的看出，受教育程度对贷款的审批有正面的影响。接下来看一下收入对贷款的审批有什么影响。收入金额因为是数值型数据，分开计算没有意义，在这里将数值型数据整理成分组数据，如：收入3000一下，3001-5000， 5001-10000，以及10001以上。

In [71]:
def grade_income(value):
    if value <= 3000:
        return 'Level 1'
    elif value < 5000:
        return 'Level 2'
    elif value < 10000:
        return 'Level 3'
    else:
        return 'Level 4'

new_df['IncomeLevel'] = new_df['ApplicantIncome'].apply(grade_income)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


整理完数据后，同样看一下收入对贷款审批的影响。

In [72]:
data = new_df.groupby(['Loan_Status', 'IncomeLevel',]).size()
# print(data)
loan_status_list = ['Y', 'N']
income_level = ['Level 1', 'Level 2', 'Level 3', 'Level 4']
show_datas = []
for status in loan_status_list:
    tmp = data[status]
    values = [tmp[level] for level in income_level]
    total = sum(values)
    ratios = [round(value / total, 2) for value in values]
    show_datas.append((status, income_level, ratios))

In [73]:
line = Line('Income Level')
i = 0
for name, attrs, values in show_datas:
    line.add(name, attrs, values, is_fill=True, area_opacity=0.4 + 0.1 * i, is_smooth=True,)
    i += 1
line

可以看到收入在Level2和Level3得到贷款审批的概率最高，反而高收入的Level4贷款的审批降低了，应该更近一来挖掘一下原因。首先将收入定义为Level4的申请人取出，看看数据有什么特点。

In [74]:
data = new_df[new_df['IncomeLevel'] == 'Level 4']
data.shape

(44, 14)

可以看到，总价有44条记录存在，对这44条记录进行深入的分析。

In [75]:
values = data.describe()
values = values['LoanAmount']
v1 = values.values[3:]

In [76]:
data = new_df[new_df['IncomeLevel'] != 'Level 4']
values = data.describe()
values = values['LoanAmount']
v2 = values.values[3:]

In [77]:
from pyecharts import Boxplot
boxplot = Boxplot('Loan Amount')
boxplot.add('收入等级', ['Level4', 'Level1-3'], [v1, v2])
boxplot

通过，对比可以很明显的看到，高收入Level4的人，申请的贷款额度偏高，这也许是通过比例偏低的原因。

贷款额也是影响审批的一个重要因素，接下来看一下，审批通过的贷款额与为通过的贷款额有什么关系？

In [100]:
attrs = []
values = []
mean_values = []
data = new_df[(new_df['Loan_Status'] == 'Y') & (new_df['IncomeLevel'] == 'Level 1')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level1_Y')
data = new_df[(new_df['Loan_Status'] == 'N') & (new_df['IncomeLevel'] == 'Level 1')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level1_N')

data = new_df[(new_df['Loan_Status'] == 'Y') & (new_df['IncomeLevel'] == 'Level 2')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level2_Y')
data = new_df[(new_df['Loan_Status'] == 'N') & (new_df['IncomeLevel'] == 'Level 2')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level2_N')

data = new_df[(new_df['Loan_Status'] == 'Y') & (new_df['IncomeLevel'] == 'Level 3')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level3_Y')
data = new_df[(new_df['Loan_Status'] == 'N') & (new_df['IncomeLevel'] == 'Level 3')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level3_N')

data = new_df[(new_df['Loan_Status'] == 'Y') & (new_df['IncomeLevel'] == 'Level 4')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level4_Y')
data = new_df[(new_df['Loan_Status'] == 'N') & (new_df['IncomeLevel'] == 'Level 4')]['LoanAmount']
values.append(data.describe().values[3:])
mean_values.append(data.describe().values[1])
attrs.append('Level4_N')

In [101]:
from pyecharts import Overlap
overlap = Overlap("贷款申请额与收入的关系")
boxplot = Boxplot('')
boxplot.add('收入等级与贷款批否', attrs, values)
overlap.add(boxplot)

line = Line('')
line.add('贷款申请额均值', attrs, mean_values)
overlap.add(line)

通过图表可以看出，在每个收入Level上，贷款审批通过的平均贷款额都比未通过的地，对于收入是Level3和Level4的尤其明显。说明，银行对大额贷款申请的审批还是非常谨慎的。

历史的信用和房子的地脚会如何影响，贷款的审批结果，留给读者自行验证。