贷款审批预测涉及各种因素的分析，例如申请人的财务历史、收入、信用评级、就业状况和其他相关属性。通过利用历史贷款数据并应用机器学习算法，企业可以建立模型来确定新申请人的贷款审批。

## 导入相应的Python库和数据集

In [115]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('loan_prediction.csv')
print(df.head())

    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0   

删除贷款ID列

In [116]:
df = df.drop('Loan_ID',axis=1)
print(df.head())

  Gender Married Dependents     Education Self_Employed  ApplicantIncome  \
0   Male      No          0      Graduate            No             5849   
1   Male     Yes          1      Graduate            No             4583   
2   Male     Yes          0      Graduate           Yes             3000   
3   Male     Yes          0  Not Graduate            No             2583   
4   Male      No          0      Graduate            No             6000   

   CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  \
0                0.0         NaN             360.0             1.0   
1             1508.0       128.0             360.0             1.0   
2                0.0        66.0             360.0             1.0   
3             2358.0       120.0             360.0             1.0   
4                0.0       141.0             360.0             1.0   

  Property_Area Loan_Status  
0         Urban           Y  
1         Rural           N  
2         Urban           Y  
3 

In [117]:
# 检查是否有缺失值
df.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

数据中某些分类列和某些数值列缺少值。在填充缺失值之前，让我们先看一下数据集的描述性统计数据：

In [118]:
print(df.describe())

       ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
count       614.000000         614.000000  592.000000         600.00000   
mean       5403.459283        1621.245798  146.412162         342.00000   
std        6109.041673        2926.248369   85.587325          65.12041   
min         150.000000           0.000000    9.000000          12.00000   
25%        2877.500000           0.000000  100.000000         360.00000   
50%        3812.500000        1188.500000  128.000000         360.00000   
75%        5795.000000        2297.250000  168.000000         360.00000   
max       81000.000000       41667.000000  700.000000         480.00000   

       Credit_History  
count      564.000000  
mean         0.842199  
std          0.364878  
min          0.000000  
25%          1.000000  
50%          1.000000  
75%          1.000000  
max          1.000000  


填充缺失值。在分类列中，我们可以用每列的众数来填充缺失值。众数表示列中出现次数最多的值，在处理分类数据时，这是一个合适的选择：

In [119]:
#  使用众数填充分类列中的缺失值
df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)
df['Married'].fillna(df['Married'].mode()[0],inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0],inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0],inplace=True)



A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




A

为了填充数值列的缺失值，我们必须选择适当的措施：

可以用中位数来填补贷款金额列的缺失值。当处理偏态分布或数据中存在异常值时，中位数是填补缺失值的合适指标；
可以用贷款金额期限列的众数值来填充该列的缺失值。由于贷款金额期限是一个离散值，因此众数值是一个合适的指标；
可以用众数值来填充信用记录列的缺失值。由于信用记录是一个二元变量（0 或 1），众数值代表最常见的值，是填充缺失值的合适选择。

In [120]:
# 使用中位数填充LoanAmount的缺失值
df['LoanAmount'].fillna(df['LoanAmount'].median(),inplace=True)
# 使用众数填充Loan_Amount_Term的缺失值
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0],inplace=True)
# 使用众数填充Credit_History的缺失值
df['Credit_History'].fillna(df['Credit_History'].mode()[0],inplace=True)



A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




A

In [121]:
# 检查是否有缺失值
df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    object 
 1   Married            614 non-null    object 
 2   Dependents         614 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      614 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 62.4+ KB


## 探索性数据分析

现在我们看一下贷款状态列的分布情况：

In [123]:
# 导入 Plotly Express 库，这是一个用于创建交互式图的高级绘图库
import plotly.express as px
loan_status_count = df['Loan_Status'].value_counts()
print(loan_status_count)
# 创建饼图
fig_loan_status = px.pie(loan_status_count,
                         names=loan_status_count.index,
                         title='是否贷款分布')

fig_loan_status.show()



Loan_Status
Y    422
N    192
Name: count, dtype: int64


现在我们来看看性别列的分布情况：

In [124]:
import plotly.express as px
gender_count = df['Gender'].value_counts()
print(gender_count)
fig_gender = px.bar(gender_count,
                    x=gender_count.index,
                    y=gender_count.values,
                    title='性别分布')
fig_gender.show()



Gender
Male      502
Female    112
Name: count, dtype: int64


现在我们来看看婚姻状况栏的分布情况

In [125]:
import plotly.express as px
married_count = df['Married'].value_counts()
fig_married = px.bar(married_count, 
                     x=married_count.index, 
                     y=married_count.values, 
                     title='婚姻状况分布')
fig_married.show()

现在我们来看看受教育的分布情况：

In [126]:
import plotly.express as px
education_count = df['Education'].value_counts()
fig_education = px.bar(education_count, 
                       x=education_count.index, 
                       y=education_count.values, 
                       title='受教育程度分布')
fig_education.show()

现在我们来看看自主创业栏的分布情况：

In [127]:
import plotly.express as ps
self_employed_count = df['Self_Employed'].value_counts()
fig_self_employed = px.bar(self_employed_count, 
                           x=self_employed_count.index, 
                           y=self_employed_count.values, 
                           title='自主创业分布')
fig_self_employed.show()



现在我们来看看申请人收入列的分布情况：

In [128]:
import plotly.express as px
fig_applicant_income = px.histogram(df,x='ApplicantIncome',title='申请人收入分布')
fig_applicant_income.show()

现在我们来看看贷款申请人的收入和贷款状况之间的关系

In [129]:
import plotly.express as px
fig_income = px.box(df,x='Loan_Status',
                    y='ApplicantIncome',
                    color="Loan_Status",
                    title='贷款状态 vs 申请人收入情况')
fig_income.show()

“申请人收入”列包含异常值，需要先移除才能继续下一步。移除异常值的方法如下：使用 IQR（四分位距）方法

In [130]:
# 1. 计算第一四分位数 (Q1)
Q1 =  df['ApplicantIncome'].quantile(0.25)
print(Q1) # 2877.5
# 2. 计算第三四分位数 (Q3)
Q3 = df['ApplicantIncome'].quantile(0.75)
print(Q3) # 5795.0
# 3. 计算四分位距 (IQR)
IQR = Q3 - Q1
print(IQR) # 2917.5
# 4. 定义异常值的下限
lower_bound = Q1 - 1.5 * IQR
print(lower_bound) # -1498.75
# 5. 定义异常值的上限
upper_bound = Q3 + 1.5 * IQR
print(upper_bound) #10171.25
# 6. 移除异常值
df = df[(df['ApplicantIncome'] >= lower_bound) & (df['ApplicantIncome'] <= upper_bound)]




2877.5
5795.0
2917.5
-1498.75
10171.25


现在我们来看看贷款共同申请人的收入和贷款状况之间的关系：

In [131]:
import plotly.express as px
fig_coapplicant_income = px.box(df, 
                                x='Loan_Status', 
                                y='CoapplicantIncome',
                                color="Loan_Status", 
                                title='贷款状态 vs 共同申请人收入')
fig_coapplicant_income.show()

In [132]:
# 贷款共同申请人的收入也包含异常值
# 计算IQR
Q1 = df['CoapplicantIncome'].quantile(0.25)
Q3 = df['CoapplicantIncome'].quantile(0.75)
IQR = Q3 - Q1

# 定义异常值的下限和上限
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# 删除异常值
df = df[(df['CoapplicantIncome'] >= lower_bound) & (df['CoapplicantIncome'] <= upper_bound)]

现在我们来看看贷款金额和贷款状态之间的关

In [133]:
fig_loan_amount = px.box(df, x='Loan_Status', 
                         y='LoanAmount', 
                         color="Loan_Status",
                         title='贷款状态 vs 贷款金额')
fig_loan_amount.show()

现在我们来看看信用记录和贷款状况之间的关系：

In [134]:
import plotly.express as px
# 创建分组直方图
fig_credit_history = px.histogram(df, x='Credit_History', color='Loan_Status', 
                                  barmode='group',
                                  title='贷款状态 vs 信用历史')
# x轴显示信用历史（通常是0和1，0=信用差，1=信用好）
# 显示图表
fig_credit_history.show()



现在我们来看看房产面积和贷款状况之间的关系：

In [135]:
import plotly.express as px
# 创建分组直方图
fig_credit_history = px.histogram(df, x='Property_Area', color='Loan_Status', 
                                  barmode='group',
                                  title='贷款状态 vs 房产地区')
# x轴显示房产地区，Urban, Rural, Semiurban）
# 显示图表
fig_credit_history.show()

 ## 数据准备和训练贷款批准预测模型

在此步骤中，我们将：

将分类列转换为数值列；
将数据分成训练集和测试集；
缩放数值特征；
训练贷款批准预测模型。

In [136]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 548 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             548 non-null    object 
 1   Married            548 non-null    object 
 2   Dependents         548 non-null    object 
 3   Education          548 non-null    object 
 4   Self_Employed      548 non-null    object 
 5   ApplicantIncome    548 non-null    int64  
 6   CoapplicantIncome  548 non-null    float64
 7   LoanAmount         548 non-null    float64
 8   Loan_Amount_Term   548 non-null    float64
 9   Credit_History     548 non-null    float64
 10  Property_Area      548 non-null    object 
 11  Loan_Status        548 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 55.7+ KB
None


In [137]:
import pandas as pd
#  将分类变量进行独热编码
cat_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
df = pd.get_dummies(df, columns=cat_cols)
print(df)


     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0               5849                0.0       128.0             360.0   
1               4583             1508.0       128.0             360.0   
2               3000                0.0        66.0             360.0   
3               2583             2358.0       120.0             360.0   
4               6000                0.0       141.0             360.0   
..               ...                ...         ...               ...   
609             2900                0.0        71.0             360.0   
610             4106                0.0        40.0             180.0   
611             8072              240.0       253.0             360.0   
612             7583                0.0       187.0             360.0   
613             4583                0.0       133.0             360.0   

     Credit_History Loan_Status  Gender_Female  Gender_Male  Married_No  \
0               1.0           Y          False  

In [138]:
# 准备特征和目标变量
x = df.drop('Loan_Status',axis=1)
y = df['Loan_Status']
print(x.head())

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0       128.0             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History  Gender_Female  Gender_Male  Married_No  Married_Yes  \
0             1.0          False         True        True        False   
1             1.0          False         True       False         True   
2             1.0          False         True       False         True   
3             1.0          False         True       False         True   
4             1.0          False         True        True        False   

   Dependents_0  Dependents_1  Dependents_2  Dependents_3+  \
0          True         False         False          False   
1   

In [146]:
from sklearn.model_selection import train_test_split
# 分割训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# random_state=42:确保每次分割结果一直，便于复现

In [148]:
# 标准化数值特征
# 创建标准化器，使数据均值为0，标准差为1
scaler = StandardScaler()
numerical_cols = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']
# 在训练集上计算均值和标准差并应用转换
x_train[numerical_cols] = scaler.fit_transform(x_train[numerical_cols])
# 使用训练集的参数转换测试集
x_test[numerical_cols] = scaler.transform(x_test[numerical_cols])
print(x_train.head())

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
50         -1.178103           0.198547   -0.592052          0.305159   
99         -1.269230           1.502563    0.015114          0.305159   
520        -1.035751           0.265913   -1.669283          0.305159   
357        -0.128256          -0.931554   -1.238390          0.305159   
304        -0.060855           0.786969    0.191388          0.305159   

     Credit_History  Gender_Female  Gender_Male  Married_No  Married_Yes  \
50         0.402248           True        False       False         True   
99         0.402248          False         True       False         True   
520        0.402248          False         True       False         True   
357        0.402248          False         True       False         True   
304        0.402248          False         True        True        False   

     Dependents_0  Dependents_1  Dependents_2  Dependents_3+  \
50           True         False         

In [141]:
# 导入并初始化SVM模型
from sklearn.svm import SVC
model = SVC(random_state=42)


In [149]:
# 训练模型
model.fit(x_train,y_train)
# 算法学习特征x_train与y_train之间的关系，找到最佳的分类边界

对测试集进行预测

In [158]:
y_pred = model.predict(x_test)
print(y_pred)
print(y_pred.shape)

['Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y'
 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'N' 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y'
 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y'
 'Y' 'Y']
(110,)


将预测的结果作为一列加入x_test,与原始特征一起显示

In [164]:
# 将x_test转换为DataFrame
x_test_df = pd.DataFrame(x_test,columns=x_test.columns)
# 将预测值添加到x_test_df
# 确保 y_pred 是合适的格式
if not hasattr(y_pred, '__len__'):
    y_pred = [y_pred]  # 如果是单个值，转换为列表


x_test_df['Loan_Status_Pred'] = y_pred

print(x_test_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 110 entries, 277 to 337
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ApplicantIncome          110 non-null    float64
 1   CoapplicantIncome        110 non-null    float64
 2   LoanAmount               110 non-null    float64
 3   Loan_Amount_Term         110 non-null    float64
 4   Credit_History           110 non-null    float64
 5   Gender_Female            110 non-null    bool   
 6   Gender_Male              110 non-null    bool   
 7   Married_No               110 non-null    bool   
 8   Married_Yes              110 non-null    bool   
 9   Dependents_0             110 non-null    bool   
 10  Dependents_1             110 non-null    bool   
 11  Dependents_2             110 non-null    bool   
 12  Dependents_3+            110 non-null    bool   
 13  Education_Graduate       110 non-null    bool   
 14  Education_Not Graduate   110 

二分类问题：贷款批准预测涉及各种因素的分析，例如申请人的财务历史、收入、信用评级、就业状况以及其他相关属性。通过利用历史贷款数据并应用机器学习算法，企业可以建立模型来确定新申请人的贷款批准